On Thursday 19 January 2017 02:05 AM, Dave Anderson wrote:
----- Original Message -----
> Without this patch, backtraces of active tasks maybe be of the form
> "#0 [c0000000700b3a90] (null) at c0000000700b3b50 (unreliable)" for
> kernel dumps captured with fadump. Trying to use ptregs saved for
> active tasks before falling back to stack-search method. Also, getting
> rid of warnings like "‘is_hugepage’ declared inline after being called".
>
> Signed-off-by: Hari Bathini <hbathini(a)linux.vnet.ibm.com>
Hari,
I only have 1 sample vmcore generated by FADUMP, and I see that
the backtraces of the non-panicking active tasks are an improvement
given that they show the exception frame register set. However, I also
note that the panic task backtrace has changed, from this using the
current method:
PID: 1913 TASK: c000000250472120 CPU: 5 COMMAND: "bash"
#0 [c000000255933620] .crash_fadump at c00000000002cbb8
#1 [c0000002559336c0] .die at c000000000030dc8
#2 [c000000255933770] .bad_page_fault at c000000000043748
#3 [c0000002559337f0] handle_page_fault at c000000000005228
Data Access [300] exception frame:
R0: 0000000000000001 R1: c000000255933ae0 R2: c000000000f27628
R3: 0000000000000063 R4: 0000000000000000 R5: ffffffffffffffff
R6: 0000000000000070 R7: 00000000000020b8 R8: 000000001cbbfaa8
R9: 0000000000000000 R10: 0000000000000002 R11: c00000000039c590
R12: 0000000028242482 R13: c000000000ff3180 R14: 000000001012b3dc
R15: 0000000000000000 R16: 0000000000000000 R17: 0000000010129c58
R18: 0000000010129bf8 R19: 000000001012b948 R20: 0000000000000000
R21: 000000001012b3e4 R22: 0000000000000000 R23: c000000000e57788
R24: 0000000000000004 R25: c000000000e57928 R26: c000000000e37414
R27: 0000000000000000 R28: 0000000000000001 R29: 0000000000000063
R30: c000000000ec9208 R31: c000000001423aac
NIP: c00000000039c57c MSR: 8000000000009032 OR3: c000000255933a20
CTR: c00000000039c560 LR: c00000000039c8c8 XER: 0000000000000001
CCR: 0000000028242482 MQ: 0000000000000000 DAR: 0000000000000000
DSISR: 0000000042000000 Syscall Result: 0000000000000000
#4 [c000000255933ae0] .sysrq_handle_crash at c00000000039c57c
[Link Register] [c000000255933ae0] .__handle_sysrq at c00000000039c8c8
#5 [c000000255933ba0] .write_sysrq_trigger at c00000000039ca70
#6 [c000000255933c30] .proc_reg_write at c000000000244874
#7 [c000000255933ce0] .vfs_write at c0000000001c9dac
#8 [c000000255933d80] .sys_write at c0000000001c9fd8
#9 [c000000255933e30] syscall_exit at c000000000008564
System Call [c00] exception frame:
R0: 0000000000000004 R1: 00000fffec87b540 R2: 00000080cec13268
R3: 0000000000000001 R4: 00000fffa55a0000 R5: 0000000000000002
R6: 000000007fffffff R7: 0000000000000000 R8: 0000000000000001
R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000
R12: 0000000000000000 R13: 00000080cea0ce10 R14: 000000001012b3dc
R15: 0000000000000000 R16: 0000000000000000 R17: 0000000010129c58
R18: 0000000010129bf8 R19: 000000001012b948 R20: 0000000000000000
R21: 000000001012b3e4 R22: 000001003391c720 R23: 0000000000000000
R24: 0000000000000001 R25: 000000001012b3e0 R26: 00000fffec87b86c
R27: 00000fffec87b868 R28: 0000000000000002 R29: 00000080cec006a0
R30: 00000fffa55a0000 R31: 0000000000000002
NIP: 00000080ceb49548 MSR: 800000000000d032 OR3: 0000000000000001
CTR: 00000080cead9d50 LR: 00000080cead9db8 XER: 0000000000000000
CCR: 0000000044242424 MQ: 0000000000000001 DAR: 00000100339436b8
DSISR: 0000000042000000 Syscall Result: 0000000000000000
to this with your patch, where the exception backtrace is missing:
PID: 1913 TASK: c000000250472120 CPU: 5 COMMAND: "bash"
R0: 0000000000000001 R1: c000000255933ae0 R2: c000000000f27628
R3: 0000000000000063 R4: 0000000000000000 R5: ffffffffffffffff
R6: 0000000000000070 R7: 00000000000020b8 R8: 000000001cbbfaa8
R9: 0000000000000000 R10: 0000000000000002 R11: c00000000039c590
R12: 0000000028242482 R13: c000000000ff3180 R14: 000000001012b3dc
R15: 0000000000000000 R16: 0000000000000000 R17: 0000000010129c58
R18: 0000000010129bf8 R19: 000000001012b948 R20: 0000000000000000
R21: 000000001012b3e4 R22: 0000000000000000 R23: c000000000e57788
R24: 0000000000000004 R25: c000000000e57928 R26: c000000000e37414
R27: 0000000000000000 R28: 0000000000000001 R29: 0000000000000063
R30: c000000000ec9208 R31: c000000001423aac
NIP: c00000000039c57c MSR: 8000000000009032 OR3: c000000255933a20
CTR: c00000000039c560 LR: c00000000039c8c8 XER: 0000000000000001
CCR: 0000000028242482 MQ: 0000000000000000 DAR: 0000000000000000
DSISR: 0000000042000000 Syscall Result: 0000000000000000
NIP [c00000000039c57c] .sysrq_handle_crash
LR [c00000000039c8c8] .__handle_sysrq
#0 [c000000255933ae0] .__handle_sysrq at c00000000039c89c
#1 [c000000255933ba0] .write_sysrq_trigger at c00000000039ca70
#2 [c000000255933c30] .proc_reg_write at c000000000244874
#3 [c000000255933ce0] .vfs_write at c0000000001c9dac
#4 [c000000255933d80] .sys_write at c0000000001c9fd8
#5 [c000000255933e30] syscall_exit at c000000000008564
System Call [c00] exception frame:
R0: 0000000000000004 R1: 00000fffec87b540 R2: 00000080cec13268
R3: 0000000000000001 R4: 00000fffa55a0000 R5: 0000000000000002
R6: 000000007fffffff R7: 0000000000000000 R8: 0000000000000001
R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000
R12: 0000000000000000 R13: 00000080cea0ce10 R14: 000000001012b3dc
R15: 0000000000000000 R16: 0000000000000000 R17: 0000000010129c58
R18: 0000000010129bf8 R19: 000000001012b948 R20: 0000000000000000
R21: 000000001012b3e4 R22: 000001003391c720 R23: 0000000000000000
R24: 0000000000000001 R25: 000000001012b3e0 R26: 00000fffec87b86c
R27: 00000fffec87b868 R28: 0000000000000002 R29: 00000080cec006a0
R30: 00000fffa55a0000 R31: 0000000000000002
NIP: 00000080ceb49548 MSR: 800000000000d032 OR3: 0000000000000001
CTR: 00000080cead9d50 LR: 00000080cead9db8 XER: 0000000000000000
CCR: 0000000044242424 MQ: 0000000000000001 DAR: 00000100339436b8
DSISR: 0000000042000000 Syscall Result: 0000000000000000
And then on a rhel7 traditional KDUMP dumpfile, both the panic task and the
non-panicking active tasks are missing the exception trace. Here's a sample
panic task backtrace using the current manner:
PID: 32696 TASK: c0000001922ed5d0 CPU: 1 COMMAND: "runtest.sh"
#0 [c000000019823610] .crash_kexec at c0000000001725e0
#1 [c000000019823810] .die at c000000000020a48
#2 [c0000000198238c0] .bad_page_fault at c0000000000530d8
#3 [c000000019823940] handle_page_fault at c000000000009584
Data Access [300] exception frame:
R0: c00000000055cf88 R1: c000000019823c30 R2: c00000000130a780
R3: 0000000000000063 R4: c000000001845888 R5: c0000000018564f8
R6: 0000000000005194 R7: c0000000014b99a0 R8: c000000000cca780
R9: 0000000000000001 R10: 0000000000000000 R11: 000000000000012f
R12: 0000000048222842 R13: c000000007b80900 R14: 0000000010142550
R15: 0000000040000000 R16: 0000000010143cdc R17: 0000000000000000
R18: 00000000101306fc R19: 00000000101424dc R20: 00000000101424e0
R21: 000000001013c6f0 R22: 000000001013c970 R23: 0000000000000000
R24: 0000000000000001 R25: 0000000000000007 R26: c00000000120b170
R27: 0000000000000063 R28: c000000001709c98 R29: c00000000120b530
R30: c0000000011d8fa0 R31: 0000000000000002
NIP: c00000000055c3f8 MSR: 8000000000009032 OR3: c000000000009358
CTR: c00000000055c3e0 LR: c00000000055cfac XER: 0000000000000001
CCR: 0000000048222822 MQ: 0000000000000000 DAR: 0000000000000000
DSISR: 0000000042000000 Syscall Result: 0000000000000000
#4 [c000000019823c30] .sysrq_handle_crash at c00000000055c3f8
[Link Register] [c000000019823c30] .write_sysrq_trigger at c00000000055cfac
#5 [c000000019823cf0] .proc_reg_write at c00000000037d120
#6 [c000000019823d80] .sys_write at c0000000002d68e4
#7 [c000000019823e30] syscall_exit at c00000000000a17c
System Call [c00] exception frame:
R0: 0000000000000004 R1: 00003fffc7738e00 R2: 00003fffb4163cc0
R3: 0000000000000001 R4: 00003fffad680000 R5: 0000000000000002
R6: 0000000000000010 R7: 0000000000000000 R8: 0000000000000000
R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000
R12: 0000000000000000 R13: 00003fffb426c330 R14: 0000000010142550
R15: 0000000040000000 R16: 0000000010143cdc R17: 0000000000000000
R18: 00000000101306fc R19: 00000000101424dc R20: 00000000101424e0
R21: 000000001013c6f0 R22: 000000001013c970 R23: 0000000000000000
R24: 0000000010143ce0 R25: 00000000100f65d0 R26: 00000100277ffa20
R27: 0000000000000001 R28: 0000000000000002 R29: 00003fffb4151108
R30: 00003fffad680000 R31: 0000000000000002
NIP: 00003fffb408a120 MSR: 800000000280f032 OR3: 0000000000000001
CTR: 0000000000000000 LR: 00003fffb4015704 XER: 0000000000000000
CCR: 0000000048222882 MQ: 0000000000000001 DAR: 00003fffad680000
DSISR: 0000000042000000 Syscall Result: 0000000000000000
And here it is with your patch:
PID: 32696 TASK: c0000001922ed5d0 CPU: 1 COMMAND: "runtest.sh"
R0: c00000000055cf88 R1: c000000019823c30 R2: c00000000130a780
R3: 0000000000000063 R4: c000000001845888 R5: c0000000018564f8
R6: 0000000000005194 R7: c0000000014b99a0 R8: c000000000cca780
R9: 0000000000000001 R10: 0000000000000000 R11: 000000000000012f
R12: 0000000048222842 R13: c000000007b80900 R14: 0000000010142550
R15: 0000000040000000 R16: 0000000010143cdc R17: 0000000000000000
R18: 00000000101306fc R19: 00000000101424dc R20: 00000000101424e0
R21: 000000001013c6f0 R22: 000000001013c970 R23: 0000000000000000
R24: 0000000000000001 R25: 0000000000000007 R26: c00000000120b170
R27: 0000000000000063 R28: c000000001709c98 R29: c00000000120b530
R30: c0000000011d8fa0 R31: 0000000000000002
NIP: c00000000055c3f8 MSR: 8000000000009032 OR3: c000000000009358
CTR: c00000000055c3e0 LR: c00000000055cfac XER: 0000000000000001
CCR: 0000000048222822 MQ: 0000000000000000 DAR: 0000000000000000
DSISR: 0000000042000000 Syscall Result: 0000000000000000
NIP [c00000000055c3f8] .sysrq_handle_crash
LR [c00000000055cfac] .write_sysrq_trigger
#0 [c000000019823c30] .write_sysrq_trigger at c00000000055cf88
#1 [c000000019823cf0] .proc_reg_write at c00000000037d120
#2 [c000000019823d80] .sys_write at c0000000002d68e4
#3 [c000000019823e30] syscall_exit at c00000000000a17c
System Call [c00] exception frame:
R0: 0000000000000004 R1: 00003fffc7738e00 R2: 00003fffb4163cc0
R3: 0000000000000001 R4: 00003fffad680000 R5: 0000000000000002
R6: 0000000000000010 R7: 0000000000000000 R8: 0000000000000000
R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000
R12: 0000000000000000 R13: 00003fffb426c330 R14: 0000000010142550
R15: 0000000040000000 R16: 0000000010143cdc R17: 0000000000000000
R18: 00000000101306fc R19: 00000000101424dc R20: 00000000101424e0
R21: 000000001013c6f0 R22: 000000001013c970 R23: 0000000000000000
R24: 0000000010143ce0 R25: 00000000100f65d0 R26: 00000100277ffa20
R27: 0000000000000001 R28: 0000000000000002 R29: 00003fffb4151108
R30: 00003fffad680000 R31: 0000000000000002
NIP: 00003fffb408a120 MSR: 800000000280f032 OR3: 0000000000000001
CTR: 0000000000000000 LR: 00003fffb4015704 XER: 0000000000000000
CCR: 0000000048222882 MQ: 0000000000000001 DAR: 00003fffad680000
DSISR: 0000000042000000 Syscall Result: 0000000000000000
And from the same kdump, here's a non-panicking active task with the current
way of doing things:
PID: 0 TASK: c000000001241c00 CPU: 0 COMMAND: "swapper/0"
#0 [c0000001dffdfb90] .crash_ipi_callback at c00000000004fd44
#1 [c0000001dffdfc20] .smp_ipi_demux at c000000000046bf8
#2 [c0000001dffdfcb0] .icp_hv_ipi_action at c000000000073454
#3 [c0000001dffdfd30] .handle_irq_event_percpu at c0000000001afaa4
#4 [c0000001dffdfe10] .handle_percpu_irq at c0000000001b526c
#5 [c0000001dffdfe90] .generic_handle_irq at c0000000001aed1c
#6 [c0000001dffdff10] .__do_irq at c000000000010d44
#7 [c0000001dffdff90] .call_do_irq at c000000000023f60
#8 [c00000000130b7e0] .do_IRQ at c000000000010eec
#9 [c00000000130b880] hardware_interrupt_common at c000000000002614
Hardware Interrupt [501] exception frame:
R0: 0000000000000000 R1: c00000000130bb70 R2: c00000000130a780
R3: 0000000000000000 R4: 0000000000000000 R5: 800000000bb71120
R6: 800000000bb844f8 R7: 0000000000000000 R8: 0000000000000000
R9: 0000000000000040 R10: 0000000000000000 R11: 000000005f9c862a
R12: 0000000000000000 R13: c000000007b80000
NIP: c0000000000849b4 MSR: 8000000000009032 OR3: 0000000000000c00
CTR: 0000000000000000 LR: c000000000710070 XER: 0000000000000000
CCR: 0000000024002084 MQ: 0000000000000001 DAR: c000000001818380
DSISR: c000000000157684 Syscall Result: 0000000000000000
#10 [c00000000130bb70] .plpar_hcall_norets at c0000000000849b4
[Link Register] [c00000000130bb70] .shared_cede_loop at c000000000710070
#11 [c00000000130bbf0] .cpuidle_idle_call at c00000000070d9b4
#12 [c00000000130bcc0] .pseries_lpar_idle at c0000000000872f0
#13 [c00000000130bd30] .arch_cpu_idle at c000000000017b44
#14 [c00000000130bdb0] .cpu_startup_entry at c000000000149b10
#15 [c00000000130be80] .rest_init at c00000000000c5f4
#16 [c00000000130bef0] .start_kernel at c000000000c34258
#17 [c00000000130bf90] start_here_common at c000000000009b6c
and here with your patch applied:
PID: 0 TASK: c000000001241c00 CPU: 0 COMMAND: "swapper/0"
R0: 0000000000000000 R1: c00000000130bb70 R2: c00000000130a780
R3: 0000000000000000 R4: 0000000000000000 R5: 800000000bb71120
R6: 800000000bb844f8 R7: 0000000000000000 R8: 0000000000000000
R9: 0000000000000040 R10: 0000000000000000 R11: 000000005f9c862a
R12: 0000000000000000 R13: c000000007b80000
NIP: c0000000000849b4 MSR: 8000000000009032 OR3: 0000000000000c00
CTR: 0000000000000000 LR: c000000000710070 XER: 0000000000000000
CCR: 0000000024002084 MQ: 0000000000000001 DAR: c000000001818380
DSISR: c000000000157684 Syscall Result: 0000000000000000
NIP [c0000000000849b4] .plpar_hcall_norets
LR [c000000000710070] .shared_cede_loop
#0 [c00000000130bb70] (null) at 3 (unreliable)
#1 [c00000000130bbf0] .cpuidle_idle_call at c00000000070d9b4
#2 [c00000000130bcc0] .pseries_lpar_idle at c0000000000872f0
#3 [c00000000130bd30] .arch_cpu_idle at c000000000017b44
#4 [c00000000130bdb0] .cpu_startup_entry at c000000000149b10
#5 [c00000000130be80] .rest_init at c00000000000c5f4
#6 [c00000000130bef0] .start_kernel at c000000000c34258
#7 [c00000000130bf90] start_here_common at c000000000009b6c
Is that what you really want?
It would be unfortunate to lose all of that exception information, both
for the panic and for all of the non-panicking active tasks.
Hi Dave,
Unfortunate, yes. But I think the exception information we are going to
lose out would be related to either crash_ipi_callback, crash_kexec,
crash_fadump or some such which may not be significant in debugging?
At least, that was the assumption with which I posted this patch..
Thanks
Hari