----- Original Message -----
Dave,
When I ran "bt" against a process running in a user mode, I got
an odd backtrace result:
===8<===
crash> ps
...
> 1324 1223 2 ffff80002018be80 RU 0.0 960 468 dhry
1325 2 1 ffff800021089900 IN 0.0 0 0
[kworker/u16:0]
crash> bt 1324
PID: 1324 TASK: ffff80002018be80 CPU: 2 COMMAND: "dhry"
ffff800022f6ae08: ffff00000812ae44 (crash_save_cpu on IRQ stack)
#0 [ffff800022f6ae10] crash_save_cpu at ffff00000812ae44
#1 [ffff800022f6ae60] handle_IPI at ffff00000808e718
#2 [ffff800022f6b020] gic_handle_irq at ffff0000080815f8
#3 [ffff800022f6b050] el0_irq_naked at ffff000008084c4c
pt_regs: ffff800022f6af60
PC: ffffffffffffffff [unknown or invalid address]
LR: ffff800020107ed0 [unknown or invalid address]
SP: 0000000000000000 PSTATE: 004016a4
X29: ffff000008084c4c X28: ffff800022f6b080 X27: ffff000008e60c54
X26: ffff800020107ed0 X25: 0000000000001fff X24: 0000000000000003
X23: ffff0000080815f8 X22: ffff800022f6b040 X21: 0000000000000000
X20: ffff000008bce000 X19: ffff00000808e758 X18: ffff800022f6b010
X17: ffff00000808a820 X16: ffff800022f6aff0 X15: 0000000000000000
X14: 0000000000000000 X13: 0000000000000000 X12: 0000000000402138
X11: ffff000008675850 X10: ffff800022f6afe0 X9: 0000000000000000
X8: ffff800022f6afc0 X7: 0000000000000000 X6: 0000000000000000
X5: 0000000000000000 X4: 0000000000000001 X3: 0000000000000000
X2: 0000000000493000 X1: 0000000000498000 X0: ffffffffffffffff
ORIG_X0: 0000000020000000 SYSCALLNO: 4021f0
bt: WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp: ffff800020107ed0 fp:
0 (?)
pt_regs: ffff800020107ed0
PC: 00000000004016a4 LR: 00000000004016a4 SP: 0000ffffc10c40a0
X29: 0000ffffc10c40a0 X28: 0000000000000000 X27: 0000000000000000
X26: 0000000000000000 X25: 0000000000402138 X24: 00000000004021f0
X23: 0000000000000000 X22: 0000000000000000 X21: 00000000004001a0
X20: 0000000000000000 X19: 0000000000000000 X18: 0000000000000000
X17: 0000000000000001 X16: 0000000000000000 X15: 0000000000493000
X14: 0000000000498000 X13: ffffffffffffffff X12: 0000000000000005
X11: 000000000000001e X10: 0101010101010101 X9: fffffffff59a9190
X8: 7f7f7f7f7f7f7f7f X7: 1f535226301f2b4c X6: 00000003001d1000
X5: 00101d0003000000 X4: 0000000000000000 X3: 4952545320454d4f
X2: 0000000010c35b40 X1: 0000000000000011 X0: 0000000010c35b40
ORIG_X0: 0000000000498700 SYSCALLNO: ffffffffffffffff PSTATE: 20000000
===>8===
* PC, LR and SP look wrong.
I don't know how those pt_regs values were derived.
* The message, "WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp:
ffff800020107ed0 fp: 0 (?)" should be refined.
Apparently, in this case, the process is running in a user mode,
and so there is no normal kernel stack.
Support for IRQ stacks was only recently put in place in crash-7.1.5,
and obviously backtraces for a crash-while-in-user-space task is not working
correctly. Unfortunately the only test kdump I have on hand only has IRQ
stack transitions from kernel space. I tried to create a kdump from a system
running user-space commands on our 4.5.0-based kernel, but as luck would
have it, kdump fails to work. (it never even reaches the secondary kernel
for some reason, even though the kdump facility says it's functional)
Obviously there's a problem in arm64_unwind_frame() trying to make the transition,
and it returns FALSE because of the NULL fp and therefore INSTACK(frame->fp, bt))
fails. The function is trying to emulate the kernel's unwind_frame() function,
which also would return -EINVAL because of the fp. But I'm not sure whether that
fp value has been set correctly because of the first, seemingly bogus, exception
frame that it's showing.
As you have seen, kernel space exceptions look like this, where the fp, sp and pc
values are legitimate, so it prints "-- <IRQ stack> --", and transitions
to the
exception frame on the process stack:
crash> set debug 1
debug: 1
crash> bt
PID: 0 TASK: fffffe035b0aae00 CPU: 3 COMMAND: "swapper/3"
fffffe03fe183d58: fffffe0000137ee4 (crash_save_cpu on IRQ stack)
#0 [fffffe03fe183d60] crash_save_cpu at fffffe0000137ee4
#1 [fffffe03fe183dc0] handle_IPI at fffffe000008e8d4
#2 [fffffe03fe183f80] gic_handle_irq at fffffe00000824c8
#3 [fffffe03fe183fd0] el1_irq at fffffe0000083520
bt: arm64_unwind_frame: switch stacks: fp: fffffe035b0f3f30 sp: fffffe035b0f3e10 pc:
fffffe000008611c
--- <IRQ stack> ---
pt_regs: fffffe035b0f3e10
PC: fffffe000008611c [arch_cpu_idle+60]
LR: fffffe0000086118 [arch_cpu_idle+56]
SP: fffffe035b0f3f30 PSTATE: 60000145
X29: fffffe035b0f3f30 X28: 0000000000000000 X27: fffffe0000084170
X26: fffffe0000bf13dc X25: fffffe0000cf4000 X24: fffffe035b0f0000
X23: 0000000000000001 X22: fffffe0000b94c48 X21: 0000000000000003
X20: fffffe0000cf6000 X19: fffffe0000cf6028 X18: 000002aabb090050
X17: 000003ff9131a228 X16: fffffe000026dba4 X15: 00000000000000bf
X14: 004894597490a924 X13: 0000000000000000 X12: 0000000000000010
X11: 0000000000000067 X10: 0000000000000ab0 X9: fffffe035b0f0000
X8: fffffe035b0ab910 X7: 0000000000007b17 X6: 000000000001c690
X5: 0000001515d0302c X4: 0100000000000000 X3: fffffe03fe184c8c
X2: fffffe03fe184c80 X1: 0000000000000000 X0: fffffe035b0f0000
ORIG_X0: fffffe035b0f0000 SYSCALLNO: fffffe0000b94c48
#4 [fffffe035b0f3e10] arch_cpu_idle at fffffe000008611c
#5 [fffffe035b0f3f40] default_idle_call at fffffe00000f81cc
#6 [fffffe035b0f3f70] cpu_startup_entry at fffffe00000f8320
#7 [fffffe035b0f3f80] secondary_start_kernel at fffffe000008e338
crash>
In your sample, it certainly doesn't appear that the first exception frame found
on the IRQ stack is legitimate, and probably should not pass the test in
arm64_is_kernel_exception_frame(), but it does:
crash> bt 1324
PID: 1324 TASK: ffff80002018be80 CPU: 2 COMMAND: "dhry"
ffff800022f6ae08: ffff00000812ae44 (crash_save_cpu on IRQ stack)
#0 [ffff800022f6ae10] crash_save_cpu at ffff00000812ae44
#1 [ffff800022f6ae60] handle_IPI at ffff00000808e718
#2 [ffff800022f6b020] gic_handle_irq at ffff0000080815f8
#3 [ffff800022f6b050] el0_irq_naked at ffff000008084c4c
pt_regs: ffff800022f6af60
PC: ffffffffffffffff [unknown or invalid address]
LR: ffff800020107ed0 [unknown or invalid address]
SP: 0000000000000000 PSTATE: 004016a4
X29: ffff000008084c4c X28: ffff800022f6b080 X27: ffff000008e60c54
X26: ffff800020107ed0 X25: 0000000000001fff X24: 0000000000000003
X23: ffff0000080815f8 X22: ffff800022f6b040 X21: 0000000000000000
X20: ffff000008bce000 X19: ffff00000808e758 X18: ffff800022f6b010
X17: ffff00000808a820 X16: ffff800022f6aff0 X15: 0000000000000000
X14: 0000000000000000 X13: 0000000000000000 X12: 0000000000402138
X11: ffff000008675850 X10: ffff800022f6afe0 X9: 0000000000000000
X8: ffff800022f6afc0 X7: 0000000000000000 X6: 0000000000000000
X5: 0000000000000000 X4: 0000000000000001 X3: 0000000000000000
X2: 0000000000493000 X1: 0000000000498000 X0: ffffffffffffffff
ORIG_X0: 0000000020000000 SYSCALLNO: 4021f0
Maybe that is the cause of the bogus "fp"? Anyway, since the orig_sp is
from a fixed location at the top of the IRQ stack, It then manages to make its
way back to the "dhry" process stack, where this exception frame
"looks" legitimate:
bt: WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp:
ffff800020107ed0 fp: 0 (?)
pt_regs: ffff800020107ed0
PC: 00000000004016a4 LR: 00000000004016a4 SP: 0000ffffc10c40a0
X29: 0000ffffc10c40a0 X28: 0000000000000000 X27: 0000000000000000
X26: 0000000000000000 X25: 0000000000402138 X24: 00000000004021f0
X23: 0000000000000000 X22: 0000000000000000 X21: 00000000004001a0
X20: 0000000000000000 X19: 0000000000000000 X18: 0000000000000000
X17: 0000000000000001 X16: 0000000000000000 X15: 0000000000493000
X14: 0000000000498000 X13: ffffffffffffffff X12: 0000000000000005
X11: 000000000000001e X10: 0101010101010101 X9: fffffffff59a9190
X8: 7f7f7f7f7f7f7f7f X7: 1f535226301f2b4c X6: 00000003001d1000
X5: 00101d0003000000 X4: 0000000000000000 X3: 4952545320454d4f
X2: 0000000010c35b40 X1: 0000000000000011 X0: 0000000010c35b40
ORIG_X0: 0000000000498700 SYSCALLNO: ffffffffffffffff PSTATE: 20000000
But I'm not sure what happens when an arm64 IRQ exception occurs when
the task is running in user space. Does it lay an exception frame down on the
process stack and then make the transition? (and therefore the user-space frame
above is legitimate?) Or does the user-space frame get laid down directly on the
IRQ stack? Unfortunately I don't know enough about arm64 exception handling.
In any case, the bt should display "-- <IRQ stack> ...", and them dump
the user-to-kernel-space exception frame, wherever it lies, i.e., either on the
normal process stack or (maybe?) on the IRQ stack.
Anyway, can you make the vmlinux/vmcore pair available for me to download? You can
send the details to me offline.
Thanks,
Dave