On Thursday 19 January 2017 07:54 PM, Dave Anderson wrote:
----- Original Message -----
>
> On Thursday 19 January 2017 02:05 AM, Dave Anderson wrote:
>> ----- Original Message -----
>>> Without this patch, backtraces of active tasks maybe be of the form
>>> "#0 [c0000000700b3a90] (null) at c0000000700b3b50 (unreliable)"
for
>>> kernel dumps captured with fadump. Trying to use ptregs saved for
>>> active tasks before falling back to stack-search method. Also, getting
>>> rid of warnings like "‘is_hugepage’ declared inline after being
called".
>>>
>>> Signed-off-by: Hari Bathini <hbathini(a)linux.vnet.ibm.com>
>> Hari,
>>
>> I only have 1 sample vmcore generated by FADUMP, and I see that
>> the backtraces of the non-panicking active tasks are an improvement
>> given that they show the exception frame register set. However, I also
>> note that the panic task backtrace has changed, from this using the
>> current method:
>>
>> PID: 1913 TASK: c000000250472120 CPU: 5 COMMAND: "bash"
>> #0 [c000000255933620] .crash_fadump at c00000000002cbb8
>> #1 [c0000002559336c0] .die at c000000000030dc8
>> #2 [c000000255933770] .bad_page_fault at c000000000043748
>> #3 [c0000002559337f0] handle_page_fault at c000000000005228
>> Data Access [300] exception frame:
>> R0: 0000000000000001 R1: c000000255933ae0 R2: c000000000f27628
>> R3: 0000000000000063 R4: 0000000000000000 R5: ffffffffffffffff
>> R6: 0000000000000070 R7: 00000000000020b8 R8: 000000001cbbfaa8
>> R9: 0000000000000000 R10: 0000000000000002 R11: c00000000039c590
>> R12: 0000000028242482 R13: c000000000ff3180 R14: 000000001012b3dc
>> R15: 0000000000000000 R16: 0000000000000000 R17: 0000000010129c58
>> R18: 0000000010129bf8 R19: 000000001012b948 R20: 0000000000000000
>> R21: 000000001012b3e4 R22: 0000000000000000 R23: c000000000e57788
>> R24: 0000000000000004 R25: c000000000e57928 R26: c000000000e37414
>> R27: 0000000000000000 R28: 0000000000000001 R29: 0000000000000063
>> R30: c000000000ec9208 R31: c000000001423aac
>> NIP: c00000000039c57c MSR: 8000000000009032 OR3: c000000255933a20
>> CTR: c00000000039c560 LR: c00000000039c8c8 XER: 0000000000000001
>> CCR: 0000000028242482 MQ: 0000000000000000 DAR: 0000000000000000
>> DSISR: 0000000042000000 Syscall Result: 0000000000000000
>> #4 [c000000255933ae0] .sysrq_handle_crash at c00000000039c57c
>> [Link Register] [c000000255933ae0] .__handle_sysrq at c00000000039c8c8
>> #5 [c000000255933ba0] .write_sysrq_trigger at c00000000039ca70
>> #6 [c000000255933c30] .proc_reg_write at c000000000244874
>> #7 [c000000255933ce0] .vfs_write at c0000000001c9dac
>> #8 [c000000255933d80] .sys_write at c0000000001c9fd8
>> #9 [c000000255933e30] syscall_exit at c000000000008564
>> System Call [c00] exception frame:
>> R0: 0000000000000004 R1: 00000fffec87b540 R2: 00000080cec13268
>> R3: 0000000000000001 R4: 00000fffa55a0000 R5: 0000000000000002
>> R6: 000000007fffffff R7: 0000000000000000 R8: 0000000000000001
>> R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000
>> R12: 0000000000000000 R13: 00000080cea0ce10 R14: 000000001012b3dc
>> R15: 0000000000000000 R16: 0000000000000000 R17: 0000000010129c58
>> R18: 0000000010129bf8 R19: 000000001012b948 R20: 0000000000000000
>> R21: 000000001012b3e4 R22: 000001003391c720 R23: 0000000000000000
>> R24: 0000000000000001 R25: 000000001012b3e0 R26: 00000fffec87b86c
>> R27: 00000fffec87b868 R28: 0000000000000002 R29: 00000080cec006a0
>> R30: 00000fffa55a0000 R31: 0000000000000002
>> NIP: 00000080ceb49548 MSR: 800000000000d032 OR3: 0000000000000001
>> CTR: 00000080cead9d50 LR: 00000080cead9db8 XER: 0000000000000000
>> CCR: 0000000044242424 MQ: 0000000000000001 DAR: 00000100339436b8
>> DSISR: 0000000042000000 Syscall Result: 0000000000000000
>>
>> to this with your patch, where the exception backtrace is missing:
>>
>> PID: 1913 TASK: c000000250472120 CPU: 5 COMMAND: "bash"
>> R0: 0000000000000001 R1: c000000255933ae0 R2: c000000000f27628
>> R3: 0000000000000063 R4: 0000000000000000 R5: ffffffffffffffff
>> R6: 0000000000000070 R7: 00000000000020b8 R8: 000000001cbbfaa8
>> R9: 0000000000000000 R10: 0000000000000002 R11: c00000000039c590
>> R12: 0000000028242482 R13: c000000000ff3180 R14: 000000001012b3dc
>> R15: 0000000000000000 R16: 0000000000000000 R17: 0000000010129c58
>> R18: 0000000010129bf8 R19: 000000001012b948 R20: 0000000000000000
>> R21: 000000001012b3e4 R22: 0000000000000000 R23: c000000000e57788
>> R24: 0000000000000004 R25: c000000000e57928 R26: c000000000e37414
>> R27: 0000000000000000 R28: 0000000000000001 R29: 0000000000000063
>> R30: c000000000ec9208 R31: c000000001423aac
>> NIP: c00000000039c57c MSR: 8000000000009032 OR3: c000000255933a20
>> CTR: c00000000039c560 LR: c00000000039c8c8 XER: 0000000000000001
>> CCR: 0000000028242482 MQ: 0000000000000000 DAR: 0000000000000000
>> DSISR: 0000000042000000 Syscall Result: 0000000000000000
>> NIP [c00000000039c57c] .sysrq_handle_crash
>> LR [c00000000039c8c8] .__handle_sysrq
>> #0 [c000000255933ae0] .__handle_sysrq at c00000000039c89c
>> #1 [c000000255933ba0] .write_sysrq_trigger at c00000000039ca70
>> #2 [c000000255933c30] .proc_reg_write at c000000000244874
>> #3 [c000000255933ce0] .vfs_write at c0000000001c9dac
>> #4 [c000000255933d80] .sys_write at c0000000001c9fd8
>> #5 [c000000255933e30] syscall_exit at c000000000008564
>> System Call [c00] exception frame:
>> R0: 0000000000000004 R1: 00000fffec87b540 R2: 00000080cec13268
>> R3: 0000000000000001 R4: 00000fffa55a0000 R5: 0000000000000002
>> R6: 000000007fffffff R7: 0000000000000000 R8: 0000000000000001
>> R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000
>> R12: 0000000000000000 R13: 00000080cea0ce10 R14: 000000001012b3dc
>> R15: 0000000000000000 R16: 0000000000000000 R17: 0000000010129c58
>> R18: 0000000010129bf8 R19: 000000001012b948 R20: 0000000000000000
>> R21: 000000001012b3e4 R22: 000001003391c720 R23: 0000000000000000
>> R24: 0000000000000001 R25: 000000001012b3e0 R26: 00000fffec87b86c
>> R27: 00000fffec87b868 R28: 0000000000000002 R29: 00000080cec006a0
>> R30: 00000fffa55a0000 R31: 0000000000000002
>> NIP: 00000080ceb49548 MSR: 800000000000d032 OR3: 0000000000000001
>> CTR: 00000080cead9d50 LR: 00000080cead9db8 XER: 0000000000000000
>> CCR: 0000000044242424 MQ: 0000000000000001 DAR: 00000100339436b8
>> DSISR: 0000000042000000 Syscall Result: 0000000000000000
>>
>>
>>
>> And then on a rhel7 traditional KDUMP dumpfile, both the panic task and the
>> non-panicking active tasks are missing the exception trace. Here's a
>> sample
>> panic task backtrace using the current manner:
>>
>> PID: 32696 TASK: c0000001922ed5d0 CPU: 1 COMMAND: "runtest.sh"
>> #0 [c000000019823610] .crash_kexec at c0000000001725e0
>> #1 [c000000019823810] .die at c000000000020a48
>> #2 [c0000000198238c0] .bad_page_fault at c0000000000530d8
>> #3 [c000000019823940] handle_page_fault at c000000000009584
>> Data Access [300] exception frame:
>> R0: c00000000055cf88 R1: c000000019823c30 R2: c00000000130a780
>> R3: 0000000000000063 R4: c000000001845888 R5: c0000000018564f8
>> R6: 0000000000005194 R7: c0000000014b99a0 R8: c000000000cca780
>> R9: 0000000000000001 R10: 0000000000000000 R11: 000000000000012f
>> R12: 0000000048222842 R13: c000000007b80900 R14: 0000000010142550
>> R15: 0000000040000000 R16: 0000000010143cdc R17: 0000000000000000
>> R18: 00000000101306fc R19: 00000000101424dc R20: 00000000101424e0
>> R21: 000000001013c6f0 R22: 000000001013c970 R23: 0000000000000000
>> R24: 0000000000000001 R25: 0000000000000007 R26: c00000000120b170
>> R27: 0000000000000063 R28: c000000001709c98 R29: c00000000120b530
>> R30: c0000000011d8fa0 R31: 0000000000000002
>> NIP: c00000000055c3f8 MSR: 8000000000009032 OR3: c000000000009358
>> CTR: c00000000055c3e0 LR: c00000000055cfac XER: 0000000000000001
>> CCR: 0000000048222822 MQ: 0000000000000000 DAR: 0000000000000000
>> DSISR: 0000000042000000 Syscall Result: 0000000000000000
>> #4 [c000000019823c30] .sysrq_handle_crash at c00000000055c3f8
>> [Link Register] [c000000019823c30] .write_sysrq_trigger at
>> c00000000055cfac
>> #5 [c000000019823cf0] .proc_reg_write at c00000000037d120
>> #6 [c000000019823d80] .sys_write at c0000000002d68e4
>> #7 [c000000019823e30] syscall_exit at c00000000000a17c
>> System Call [c00] exception frame:
>> R0: 0000000000000004 R1: 00003fffc7738e00 R2: 00003fffb4163cc0
>> R3: 0000000000000001 R4: 00003fffad680000 R5: 0000000000000002
>> R6: 0000000000000010 R7: 0000000000000000 R8: 0000000000000000
>> R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000
>> R12: 0000000000000000 R13: 00003fffb426c330 R14: 0000000010142550
>> R15: 0000000040000000 R16: 0000000010143cdc R17: 0000000000000000
>> R18: 00000000101306fc R19: 00000000101424dc R20: 00000000101424e0
>> R21: 000000001013c6f0 R22: 000000001013c970 R23: 0000000000000000
>> R24: 0000000010143ce0 R25: 00000000100f65d0 R26: 00000100277ffa20
>> R27: 0000000000000001 R28: 0000000000000002 R29: 00003fffb4151108
>> R30: 00003fffad680000 R31: 0000000000000002
>> NIP: 00003fffb408a120 MSR: 800000000280f032 OR3: 0000000000000001
>> CTR: 0000000000000000 LR: 00003fffb4015704 XER: 0000000000000000
>> CCR: 0000000048222882 MQ: 0000000000000001 DAR: 00003fffad680000
>> DSISR: 0000000042000000 Syscall Result: 0000000000000000
>>
>> And here it is with your patch:
>>
>> PID: 32696 TASK: c0000001922ed5d0 CPU: 1 COMMAND: "runtest.sh"
>> R0: c00000000055cf88 R1: c000000019823c30 R2: c00000000130a780
>> R3: 0000000000000063 R4: c000000001845888 R5: c0000000018564f8
>> R6: 0000000000005194 R7: c0000000014b99a0 R8: c000000000cca780
>> R9: 0000000000000001 R10: 0000000000000000 R11: 000000000000012f
>> R12: 0000000048222842 R13: c000000007b80900 R14: 0000000010142550
>> R15: 0000000040000000 R16: 0000000010143cdc R17: 0000000000000000
>> R18: 00000000101306fc R19: 00000000101424dc R20: 00000000101424e0
>> R21: 000000001013c6f0 R22: 000000001013c970 R23: 0000000000000000
>> R24: 0000000000000001 R25: 0000000000000007 R26: c00000000120b170
>> R27: 0000000000000063 R28: c000000001709c98 R29: c00000000120b530
>> R30: c0000000011d8fa0 R31: 0000000000000002
>> NIP: c00000000055c3f8 MSR: 8000000000009032 OR3: c000000000009358
>> CTR: c00000000055c3e0 LR: c00000000055cfac XER: 0000000000000001
>> CCR: 0000000048222822 MQ: 0000000000000000 DAR: 0000000000000000
>> DSISR: 0000000042000000 Syscall Result: 0000000000000000
>> NIP [c00000000055c3f8] .sysrq_handle_crash
>> LR [c00000000055cfac] .write_sysrq_trigger
>> #0 [c000000019823c30] .write_sysrq_trigger at c00000000055cf88
>> #1 [c000000019823cf0] .proc_reg_write at c00000000037d120
>> #2 [c000000019823d80] .sys_write at c0000000002d68e4
>> #3 [c000000019823e30] syscall_exit at c00000000000a17c
>> System Call [c00] exception frame:
>> R0: 0000000000000004 R1: 00003fffc7738e00 R2: 00003fffb4163cc0
>> R3: 0000000000000001 R4: 00003fffad680000 R5: 0000000000000002
>> R6: 0000000000000010 R7: 0000000000000000 R8: 0000000000000000
>> R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000
>> R12: 0000000000000000 R13: 00003fffb426c330 R14: 0000000010142550
>> R15: 0000000040000000 R16: 0000000010143cdc R17: 0000000000000000
>> R18: 00000000101306fc R19: 00000000101424dc R20: 00000000101424e0
>> R21: 000000001013c6f0 R22: 000000001013c970 R23: 0000000000000000
>> R24: 0000000010143ce0 R25: 00000000100f65d0 R26: 00000100277ffa20
>> R27: 0000000000000001 R28: 0000000000000002 R29: 00003fffb4151108
>> R30: 00003fffad680000 R31: 0000000000000002
>> NIP: 00003fffb408a120 MSR: 800000000280f032 OR3: 0000000000000001
>> CTR: 0000000000000000 LR: 00003fffb4015704 XER: 0000000000000000
>> CCR: 0000000048222882 MQ: 0000000000000001 DAR: 00003fffad680000
>> DSISR: 0000000042000000 Syscall Result: 0000000000000000
>>
>> And from the same kdump, here's a non-panicking active task with the
>> current
>> way of doing things:
>>
>> PID: 0 TASK: c000000001241c00 CPU: 0 COMMAND: "swapper/0"
>> #0 [c0000001dffdfb90] .crash_ipi_callback at c00000000004fd44
>> #1 [c0000001dffdfc20] .smp_ipi_demux at c000000000046bf8
>> #2 [c0000001dffdfcb0] .icp_hv_ipi_action at c000000000073454
>> #3 [c0000001dffdfd30] .handle_irq_event_percpu at c0000000001afaa4
>> #4 [c0000001dffdfe10] .handle_percpu_irq at c0000000001b526c
>> #5 [c0000001dffdfe90] .generic_handle_irq at c0000000001aed1c
>> #6 [c0000001dffdff10] .__do_irq at c000000000010d44
>> #7 [c0000001dffdff90] .call_do_irq at c000000000023f60
>> #8 [c00000000130b7e0] .do_IRQ at c000000000010eec
>> #9 [c00000000130b880] hardware_interrupt_common at c000000000002614
>> Hardware Interrupt [501] exception frame:
>> R0: 0000000000000000 R1: c00000000130bb70 R2: c00000000130a780
>> R3: 0000000000000000 R4: 0000000000000000 R5: 800000000bb71120
>> R6: 800000000bb844f8 R7: 0000000000000000 R8: 0000000000000000
>> R9: 0000000000000040 R10: 0000000000000000 R11: 000000005f9c862a
>> R12: 0000000000000000 R13: c000000007b80000
>> NIP: c0000000000849b4 MSR: 8000000000009032 OR3: 0000000000000c00
>> CTR: 0000000000000000 LR: c000000000710070 XER: 0000000000000000
>> CCR: 0000000024002084 MQ: 0000000000000001 DAR: c000000001818380
>> DSISR: c000000000157684 Syscall Result: 0000000000000000
>> #10 [c00000000130bb70] .plpar_hcall_norets at c0000000000849b4
>> [Link Register] [c00000000130bb70] .shared_cede_loop at c000000000710070
>> #11 [c00000000130bbf0] .cpuidle_idle_call at c00000000070d9b4
>> #12 [c00000000130bcc0] .pseries_lpar_idle at c0000000000872f0
>> #13 [c00000000130bd30] .arch_cpu_idle at c000000000017b44
>> #14 [c00000000130bdb0] .cpu_startup_entry at c000000000149b10
>> #15 [c00000000130be80] .rest_init at c00000000000c5f4
>> #16 [c00000000130bef0] .start_kernel at c000000000c34258
>> #17 [c00000000130bf90] start_here_common at c000000000009b6c
>>
>> and here with your patch applied:
>>
>> PID: 0 TASK: c000000001241c00 CPU: 0 COMMAND: "swapper/0"
>> R0: 0000000000000000 R1: c00000000130bb70 R2: c00000000130a780
>> R3: 0000000000000000 R4: 0000000000000000 R5: 800000000bb71120
>> R6: 800000000bb844f8 R7: 0000000000000000 R8: 0000000000000000
>> R9: 0000000000000040 R10: 0000000000000000 R11: 000000005f9c862a
>> R12: 0000000000000000 R13: c000000007b80000
>> NIP: c0000000000849b4 MSR: 8000000000009032 OR3: 0000000000000c00
>> CTR: 0000000000000000 LR: c000000000710070 XER: 0000000000000000
>> CCR: 0000000024002084 MQ: 0000000000000001 DAR: c000000001818380
>> DSISR: c000000000157684 Syscall Result: 0000000000000000
>> NIP [c0000000000849b4] .plpar_hcall_norets
>> LR [c000000000710070] .shared_cede_loop
>> #0 [c00000000130bb70] (null) at 3 (unreliable)
>> #1 [c00000000130bbf0] .cpuidle_idle_call at c00000000070d9b4
>> #2 [c00000000130bcc0] .pseries_lpar_idle at c0000000000872f0
>> #3 [c00000000130bd30] .arch_cpu_idle at c000000000017b44
>> #4 [c00000000130bdb0] .cpu_startup_entry at c000000000149b10
>> #5 [c00000000130be80] .rest_init at c00000000000c5f4
>> #6 [c00000000130bef0] .start_kernel at c000000000c34258
>> #7 [c00000000130bf90] start_here_common at c000000000009b6c
>>
>> Is that what you really want?
>>
>> It would be unfortunate to lose all of that exception information, both
>> for the panic and for all of the non-panicking active tasks.
> Hi Dave,
>
> Unfortunate, yes. But I think the exception information we are going to
> lose out would be related to either crash_ipi_callback, crash_kexec,
> crash_fadump or some such which may not be significant in debugging?
> At least, that was the assumption with which I posted this patch..
While it is true in the case of crash IPI callbacks, they are legitimate
parts of the trace, and it's worth "exercising" that backtrace path. Have
you tested a crash that actually occurred while running on the hard or
soft IRQ stack?
Also, the exception frame doesn't even show the [bracketed] type of exception
that occurred -- it's just a register dump followed by the remainder of the
backtrace. Upon a quick glance, it's not obvious that they are even active
tasks. And traditionally, all of the other architectures have always dumped
a full trace.
I'm not sure what the mechanism is for shutting down the non-active
FADUMP tasks, so that's why I asked if you could restrict this change
to just those types of dumps. (For that matter, is it even possible to
differentiate a real kdump from an FADUMP dumpfile -- aside from a
Hi Dave,
Differentiating a kdump and fadump dumpfile is not possible except that the
stack search would invariably fail and ptregs are guaranteed to be saved by
firmware in case of fadump. Posted v2 that doesn't change bt output for anything
but active tasks in case of fadump..
Thanks
Hari