On Thu, Jun 02, 2016 at 10:52:28AM -0400, Dave Anderson wrote:
----- Original Message -----
> Dave,
>
> When I ran "bt" against a process running in a user mode, I got
> an odd backtrace result:
> ===8<===
> crash> ps
> ...
> > 1324 1223 2 ffff80002018be80 RU 0.0 960 468 dhry
> 1325 2 1 ffff800021089900 IN 0.0 0 0
> [kworker/u16:0]
> crash> bt 1324
> PID: 1324 TASK: ffff80002018be80 CPU: 2 COMMAND: "dhry"
> ffff800022f6ae08: ffff00000812ae44 (crash_save_cpu on IRQ stack)
> #0 [ffff800022f6ae10] crash_save_cpu at ffff00000812ae44
> #1 [ffff800022f6ae60] handle_IPI at ffff00000808e718
> #2 [ffff800022f6b020] gic_handle_irq at ffff0000080815f8
> #3 [ffff800022f6b050] el0_irq_naked at ffff000008084c4c
> pt_regs: ffff800022f6af60
> PC: ffffffffffffffff [unknown or invalid address]
> LR: ffff800020107ed0 [unknown or invalid address]
> SP: 0000000000000000 PSTATE: 004016a4
> X29: ffff000008084c4c X28: ffff800022f6b080 X27: ffff000008e60c54
> X26: ffff800020107ed0 X25: 0000000000001fff X24: 0000000000000003
> X23: ffff0000080815f8 X22: ffff800022f6b040 X21: 0000000000000000
> X20: ffff000008bce000 X19: ffff00000808e758 X18: ffff800022f6b010
> X17: ffff00000808a820 X16: ffff800022f6aff0 X15: 0000000000000000
> X14: 0000000000000000 X13: 0000000000000000 X12: 0000000000402138
> X11: ffff000008675850 X10: ffff800022f6afe0 X9: 0000000000000000
> X8: ffff800022f6afc0 X7: 0000000000000000 X6: 0000000000000000
> X5: 0000000000000000 X4: 0000000000000001 X3: 0000000000000000
> X2: 0000000000493000 X1: 0000000000498000 X0: ffffffffffffffff
> ORIG_X0: 0000000020000000 SYSCALLNO: 4021f0
> bt: WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp: ffff800020107ed0 fp:
> 0 (?)
> pt_regs: ffff800020107ed0
> PC: 00000000004016a4 LR: 00000000004016a4 SP: 0000ffffc10c40a0
> X29: 0000ffffc10c40a0 X28: 0000000000000000 X27: 0000000000000000
> X26: 0000000000000000 X25: 0000000000402138 X24: 00000000004021f0
> X23: 0000000000000000 X22: 0000000000000000 X21: 00000000004001a0
> X20: 0000000000000000 X19: 0000000000000000 X18: 0000000000000000
> X17: 0000000000000001 X16: 0000000000000000 X15: 0000000000493000
> X14: 0000000000498000 X13: ffffffffffffffff X12: 0000000000000005
> X11: 000000000000001e X10: 0101010101010101 X9: fffffffff59a9190
> X8: 7f7f7f7f7f7f7f7f X7: 1f535226301f2b4c X6: 00000003001d1000
> X5: 00101d0003000000 X4: 0000000000000000 X3: 4952545320454d4f
> X2: 0000000010c35b40 X1: 0000000000000011 X0: 0000000010c35b40
> ORIG_X0: 0000000000498700 SYSCALLNO: ffffffffffffffff PSTATE: 20000000
> ===>8===
>
> * PC, LR and SP look wrong.
> I don't know how those pt_regs values were derived.
> * The message, "WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp:
> ffff800020107ed0 fp: 0 (?)" should be refined.
> Apparently, in this case, the process is running in a user mode,
> and so there is no normal kernel stack.
Support for IRQ stacks was only recently put in place in crash-7.1.5,
and obviously backtraces for a crash-while-in-user-space task is not working
correctly. Unfortunately the only test kdump I have on hand only has IRQ
stack transitions from kernel space. I tried to create a kdump from a system
running user-space commands on our 4.5.0-based kernel, but as luck would
have it, kdump fails to work. (it never even reaches the secondary kernel
for some reason, even though the kdump facility says it's functional)
Obviously there's a problem in arm64_unwind_frame() trying to make the transition,
and it returns FALSE because of the NULL fp and therefore INSTACK(frame->fp, bt))
fails. The function is trying to emulate the kernel's unwind_frame() function,
which also would return -EINVAL because of the fp. But I'm not sure whether that
fp value has been set correctly because of the first, seemingly bogus, exception
frame that it's showing.
As you have seen, kernel space exceptions look like this, where the fp, sp and pc
values are legitimate, so it prints "-- <IRQ stack> --", and transitions
to the
exception frame on the process stack:
crash> set debug 1
debug: 1
crash> bt
PID: 0 TASK: fffffe035b0aae00 CPU: 3 COMMAND: "swapper/3"
fffffe03fe183d58: fffffe0000137ee4 (crash_save_cpu on IRQ stack)
#0 [fffffe03fe183d60] crash_save_cpu at fffffe0000137ee4
#1 [fffffe03fe183dc0] handle_IPI at fffffe000008e8d4
#2 [fffffe03fe183f80] gic_handle_irq at fffffe00000824c8
#3 [fffffe03fe183fd0] el1_irq at fffffe0000083520
bt: arm64_unwind_frame: switch stacks: fp: fffffe035b0f3f30 sp: fffffe035b0f3e10 pc:
fffffe000008611c
--- <IRQ stack> ---
pt_regs: fffffe035b0f3e10
PC: fffffe000008611c [arch_cpu_idle+60]
LR: fffffe0000086118 [arch_cpu_idle+56]
SP: fffffe035b0f3f30 PSTATE: 60000145
X29: fffffe035b0f3f30 X28: 0000000000000000 X27: fffffe0000084170
X26: fffffe0000bf13dc X25: fffffe0000cf4000 X24: fffffe035b0f0000
X23: 0000000000000001 X22: fffffe0000b94c48 X21: 0000000000000003
X20: fffffe0000cf6000 X19: fffffe0000cf6028 X18: 000002aabb090050
X17: 000003ff9131a228 X16: fffffe000026dba4 X15: 00000000000000bf
X14: 004894597490a924 X13: 0000000000000000 X12: 0000000000000010
X11: 0000000000000067 X10: 0000000000000ab0 X9: fffffe035b0f0000
X8: fffffe035b0ab910 X7: 0000000000007b17 X6: 000000000001c690
X5: 0000001515d0302c X4: 0100000000000000 X3: fffffe03fe184c8c
X2: fffffe03fe184c80 X1: 0000000000000000 X0: fffffe035b0f0000
ORIG_X0: fffffe035b0f0000 SYSCALLNO: fffffe0000b94c48
#4 [fffffe035b0f3e10] arch_cpu_idle at fffffe000008611c
#5 [fffffe035b0f3f40] default_idle_call at fffffe00000f81cc
#6 [fffffe035b0f3f70] cpu_startup_entry at fffffe00000f8320
#7 [fffffe035b0f3f80] secondary_start_kernel at fffffe000008e338
crash>
In your sample, it certainly doesn't appear that the first exception frame found
on the IRQ stack is legitimate, and probably should not pass the test in
arm64_is_kernel_exception_frame(), but it does:
> crash> bt 1324
> PID: 1324 TASK: ffff80002018be80 CPU: 2 COMMAND: "dhry"
> ffff800022f6ae08: ffff00000812ae44 (crash_save_cpu on IRQ stack)
> #0 [ffff800022f6ae10] crash_save_cpu at ffff00000812ae44
> #1 [ffff800022f6ae60] handle_IPI at ffff00000808e718
> #2 [ffff800022f6b020] gic_handle_irq at ffff0000080815f8
> #3 [ffff800022f6b050] el0_irq_naked at ffff000008084c4c
> pt_regs: ffff800022f6af60
> PC: ffffffffffffffff [unknown or invalid address]
> LR: ffff800020107ed0 [unknown or invalid address]
> SP: 0000000000000000 PSTATE: 004016a4
> X29: ffff000008084c4c X28: ffff800022f6b080 X27: ffff000008e60c54
> X26: ffff800020107ed0 X25: 0000000000001fff X24: 0000000000000003
> X23: ffff0000080815f8 X22: ffff800022f6b040 X21: 0000000000000000
> X20: ffff000008bce000 X19: ffff00000808e758 X18: ffff800022f6b010
> X17: ffff00000808a820 X16: ffff800022f6aff0 X15: 0000000000000000
> X14: 0000000000000000 X13: 0000000000000000 X12: 0000000000402138
> X11: ffff000008675850 X10: ffff800022f6afe0 X9: 0000000000000000
> X8: ffff800022f6afc0 X7: 0000000000000000 X6: 0000000000000000
> X5: 0000000000000000 X4: 0000000000000001 X3: 0000000000000000
> X2: 0000000000493000 X1: 0000000000498000 X0: ffffffffffffffff
> ORIG_X0: 0000000020000000 SYSCALLNO: 4021f0
Maybe that is the cause of the bogus "fp"? Anyway, since the orig_sp is
from a fixed location at the top of the IRQ stack, It then manages to make its
way back to the "dhry" process stack, where this exception frame
"looks" legitimate:
> bt: WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp: ffff800020107ed0 fp: 0 (?)
> pt_regs: ffff800020107ed0
> PC: 00000000004016a4 LR: 00000000004016a4 SP: 0000ffffc10c40a0
> X29: 0000ffffc10c40a0 X28: 0000000000000000 X27: 0000000000000000
> X26: 0000000000000000 X25: 0000000000402138 X24: 00000000004021f0
> X23: 0000000000000000 X22: 0000000000000000 X21: 00000000004001a0
> X20: 0000000000000000 X19: 0000000000000000 X18: 0000000000000000
> X17: 0000000000000001 X16: 0000000000000000 X15: 0000000000493000
> X14: 0000000000498000 X13: ffffffffffffffff X12: 0000000000000005
> X11: 000000000000001e X10: 0101010101010101 X9: fffffffff59a9190
> X8: 7f7f7f7f7f7f7f7f X7: 1f535226301f2b4c X6: 00000003001d1000
> X5: 00101d0003000000 X4: 0000000000000000 X3: 4952545320454d4f
> X2: 0000000010c35b40 X1: 0000000000000011 X0: 0000000010c35b40
> ORIG_X0: 0000000000498700 SYSCALLNO: ffffffffffffffff PSTATE: 20000000
But I'm not sure what happens when an arm64 IRQ exception occurs when
the task is running in user space. Does it lay an exception frame down on the
process stack and then make the transition? (and therefore the user-space frame
above is legitimate?) Or does the user-space frame get laid down directly on the
IRQ stack? Unfortunately I don't know enough about arm64 exception handling.
Since I reviewed this IRQ stack patch in LAK-ML, I will be able to help you.
but I don't have enough time to explain in details this week.
In any case, the bt should display "-- <IRQ stack>
...", and them dump
the user-to-kernel-space exception frame, wherever it lies, i.e., either on the
normal process stack or (maybe?) on the IRQ stack.
Anyway, can you make the vmlinux/vmcore pair available for me to download? You can
send the details to me offline.
I sent you a message which contains the link to those binaries.
Thanks,
-Takahiro AKASHI
Thanks,
Dave
--
Crash-utility mailing list
Crash-utility(a)redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
--
Thanks,
-Takahiro AKASHI