Re: [Crash-utility] arm64: odd backtrace?

Friday, 3 June 2016

On Thu, Jun 02, 2016 at 10:52:28AM -0400, Dave Anderson wrote:
...

 ----- Original Message -----
 > Dave,
 > 
 > When I ran "bt" against a process running in a user mode, I got
 > an odd backtrace result:
 > ===8<===
 > crash> ps
 >    ...
 > >  1324   1223   2  ffff80002018be80  RU   0.0     960    468  dhry
 >    1325      2   1  ffff800021089900  IN   0.0       0      0
 >    [kworker/u16:0]
 > crash> bt 1324
 > PID: 1324   TASK: ffff80002018be80  CPU: 2   COMMAND: "dhry"
 > ffff800022f6ae08: ffff00000812ae44 (crash_save_cpu on IRQ stack)
 >  #0 [ffff800022f6ae10] crash_save_cpu at ffff00000812ae44
 >  #1 [ffff800022f6ae60] handle_IPI at ffff00000808e718
 >  #2 [ffff800022f6b020] gic_handle_irq at ffff0000080815f8
 >  #3 [ffff800022f6b050] el0_irq_naked at ffff000008084c4c
 > pt_regs: ffff800022f6af60
 >      PC: ffffffffffffffff  [unknown or invalid address]
 >      LR: ffff800020107ed0  [unknown or invalid address]
 >      SP: 0000000000000000  PSTATE: 004016a4
 >     X29: ffff000008084c4c  X28: ffff800022f6b080  X27: ffff000008e60c54
 >     X26: ffff800020107ed0  X25: 0000000000001fff  X24: 0000000000000003
 >     X23: ffff0000080815f8  X22: ffff800022f6b040  X21: 0000000000000000
 >     X20: ffff000008bce000  X19: ffff00000808e758  X18: ffff800022f6b010
 >     X17: ffff00000808a820  X16: ffff800022f6aff0  X15: 0000000000000000
 >     X14: 0000000000000000  X13: 0000000000000000  X12: 0000000000402138
 >     X11: ffff000008675850  X10: ffff800022f6afe0   X9: 0000000000000000
 >      X8: ffff800022f6afc0   X7: 0000000000000000   X6: 0000000000000000
 >      X5: 0000000000000000   X4: 0000000000000001   X3: 0000000000000000
 >      X2: 0000000000493000   X1: 0000000000498000   X0: ffffffffffffffff
 >     ORIG_X0: 0000000020000000  SYSCALLNO: 4021f0
 > bt: WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp: ffff800020107ed0 fp:
 > 0 (?)
 > pt_regs: ffff800020107ed0
 >      PC: 00000000004016a4   LR: 00000000004016a4   SP: 0000ffffc10c40a0
 >     X29: 0000ffffc10c40a0  X28: 0000000000000000  X27: 0000000000000000
 >     X26: 0000000000000000  X25: 0000000000402138  X24: 00000000004021f0
 >     X23: 0000000000000000  X22: 0000000000000000  X21: 00000000004001a0
 >     X20: 0000000000000000  X19: 0000000000000000  X18: 0000000000000000
 >     X17: 0000000000000001  X16: 0000000000000000  X15: 0000000000493000
 >     X14: 0000000000498000  X13: ffffffffffffffff  X12: 0000000000000005
 >     X11: 000000000000001e  X10: 0101010101010101   X9: fffffffff59a9190
 >      X8: 7f7f7f7f7f7f7f7f   X7: 1f535226301f2b4c   X6: 00000003001d1000
 >      X5: 00101d0003000000   X4: 0000000000000000   X3: 4952545320454d4f
 >      X2: 0000000010c35b40   X1: 0000000000000011   X0: 0000000010c35b40
 >     ORIG_X0: 0000000000498700  SYSCALLNO: ffffffffffffffff  PSTATE: 20000000
 > ===>8===
 > 
 > * PC, LR and SP look wrong.
 >   I don't know how those pt_regs values were derived.
 > * The message, "WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp:
 >   ffff800020107ed0 fp: 0 (?)" should be refined.
 >   Apparently, in this case, the process is running in a user mode,
 >   and so there is no normal kernel stack.

 Support for IRQ stacks was only recently put in place in crash-7.1.5,
 and obviously backtraces for a crash-while-in-user-space task is not working 
 correctly.  Unfortunately the only test kdump I have on hand only has IRQ
 stack transitions from kernel space.  I tried to create a kdump from a system
 running user-space commands on our 4.5.0-based kernel, but as luck would
 have it, kdump fails to work.  (it never even reaches the secondary kernel
 for some reason, even though the kdump facility says it's functional)

 Obviously there's a problem in arm64_unwind_frame() trying to make the transition,
 and it returns FALSE because of the NULL fp and therefore INSTACK(frame->fp, bt))
 fails.   The function is trying to emulate the kernel's unwind_frame() function,
 which also would return -EINVAL because of the fp.  But I'm not sure whether that
 fp value has been set correctly because of the first, seemingly bogus, exception 
 frame that it's showing.

 As you have seen, kernel space exceptions look like this, where the fp, sp and pc
 values are legitimate, so it prints "-- <IRQ stack> --", and transitions
to the
 exception frame on the process stack:

   crash> set debug 1
   debug: 1
   crash> bt
   PID: 0      TASK: fffffe035b0aae00  CPU: 3   COMMAND: "swapper/3"
   fffffe03fe183d58: fffffe0000137ee4 (crash_save_cpu on IRQ stack)
    #0 [fffffe03fe183d60] crash_save_cpu at fffffe0000137ee4
    #1 [fffffe03fe183dc0] handle_IPI at fffffe000008e8d4
    #2 [fffffe03fe183f80] gic_handle_irq at fffffe00000824c8
    #3 [fffffe03fe183fd0] el1_irq at fffffe0000083520
   bt: arm64_unwind_frame: switch stacks: fp: fffffe035b0f3f30 sp: fffffe035b0f3e10  pc:
fffffe000008611c
   --- <IRQ stack> ---
   pt_regs: fffffe035b0f3e10
        PC: fffffe000008611c  [arch_cpu_idle+60]
        LR: fffffe0000086118  [arch_cpu_idle+56]
        SP: fffffe035b0f3f30  PSTATE: 60000145
       X29: fffffe035b0f3f30  X28: 0000000000000000  X27: fffffe0000084170
       X26: fffffe0000bf13dc  X25: fffffe0000cf4000  X24: fffffe035b0f0000
       X23: 0000000000000001  X22: fffffe0000b94c48  X21: 0000000000000003
       X20: fffffe0000cf6000  X19: fffffe0000cf6028  X18: 000002aabb090050
       X17: 000003ff9131a228  X16: fffffe000026dba4  X15: 00000000000000bf
       X14: 004894597490a924  X13: 0000000000000000  X12: 0000000000000010
       X11: 0000000000000067  X10: 0000000000000ab0   X9: fffffe035b0f0000
        X8: fffffe035b0ab910   X7: 0000000000007b17   X6: 000000000001c690
        X5: 0000001515d0302c   X4: 0100000000000000   X3: fffffe03fe184c8c
        X2: fffffe03fe184c80   X1: 0000000000000000   X0: fffffe035b0f0000
       ORIG_X0: fffffe035b0f0000  SYSCALLNO: fffffe0000b94c48
    #4 [fffffe035b0f3e10] arch_cpu_idle at fffffe000008611c
    #5 [fffffe035b0f3f40] default_idle_call at fffffe00000f81cc
    #6 [fffffe035b0f3f70] cpu_startup_entry at fffffe00000f8320
    #7 [fffffe035b0f3f80] secondary_start_kernel at fffffe000008e338
   crash>

 In your sample, it certainly doesn't appear that the first exception frame found
 on the IRQ stack is legitimate, and probably should not pass the test in 
 arm64_is_kernel_exception_frame(), but it does:

 > crash> bt 1324
 > PID: 1324   TASK: ffff80002018be80  CPU: 2   COMMAND: "dhry"
 > ffff800022f6ae08: ffff00000812ae44 (crash_save_cpu on IRQ stack)
 >  #0 [ffff800022f6ae10] crash_save_cpu at ffff00000812ae44
 >  #1 [ffff800022f6ae60] handle_IPI at ffff00000808e718
 >  #2 [ffff800022f6b020] gic_handle_irq at ffff0000080815f8
 >  #3 [ffff800022f6b050] el0_irq_naked at ffff000008084c4c
 > pt_regs: ffff800022f6af60
 >      PC: ffffffffffffffff  [unknown or invalid address]
 >      LR: ffff800020107ed0  [unknown or invalid address]
 >      SP: 0000000000000000  PSTATE: 004016a4
 >     X29: ffff000008084c4c  X28: ffff800022f6b080  X27: ffff000008e60c54
 >     X26: ffff800020107ed0  X25: 0000000000001fff  X24: 0000000000000003
 >     X23: ffff0000080815f8  X22: ffff800022f6b040  X21: 0000000000000000
 >     X20: ffff000008bce000  X19: ffff00000808e758  X18: ffff800022f6b010
 >     X17: ffff00000808a820  X16: ffff800022f6aff0  X15: 0000000000000000
 >     X14: 0000000000000000  X13: 0000000000000000  X12: 0000000000402138
 >     X11: ffff000008675850  X10: ffff800022f6afe0   X9: 0000000000000000
 >      X8: ffff800022f6afc0   X7: 0000000000000000   X6: 0000000000000000
 >      X5: 0000000000000000   X4: 0000000000000001   X3: 0000000000000000
 >      X2: 0000000000493000   X1: 0000000000498000   X0: ffffffffffffffff
 >     ORIG_X0: 0000000020000000  SYSCALLNO: 4021f0

 Maybe that is the cause of the bogus "fp"?  Anyway, since the orig_sp is 
 from a fixed location at the top of the IRQ stack, It then manages to make its 
 way back to the "dhry" process stack, where this exception frame
"looks" legitimate:

 > bt: WARNING: arm64_unwind_frame: on IRQ stack: oriq_sp: ffff800020107ed0 fp: 0 (?)
 > pt_regs: ffff800020107ed0
 >      PC: 00000000004016a4   LR: 00000000004016a4   SP: 0000ffffc10c40a0
 >     X29: 0000ffffc10c40a0  X28: 0000000000000000  X27: 0000000000000000
 >     X26: 0000000000000000  X25: 0000000000402138  X24: 00000000004021f0
 >     X23: 0000000000000000  X22: 0000000000000000  X21: 00000000004001a0
 >     X20: 0000000000000000  X19: 0000000000000000  X18: 0000000000000000
 >     X17: 0000000000000001  X16: 0000000000000000  X15: 0000000000493000
 >     X14: 0000000000498000  X13: ffffffffffffffff  X12: 0000000000000005
 >     X11: 000000000000001e  X10: 0101010101010101   X9: fffffffff59a9190
 >      X8: 7f7f7f7f7f7f7f7f   X7: 1f535226301f2b4c   X6: 00000003001d1000
 >      X5: 00101d0003000000   X4: 0000000000000000   X3: 4952545320454d4f
 >      X2: 0000000010c35b40   X1: 0000000000000011   X0: 0000000010c35b40
 >     ORIG_X0: 0000000000498700  SYSCALLNO: ffffffffffffffff  PSTATE: 20000000

 But I'm not sure what happens when an arm64 IRQ exception occurs when
 the task is running in user space.  Does it lay an exception frame down on the
 process stack and then make the transition?  (and therefore the user-space frame
 above is legitimate?)  Or does the user-space frame get laid down directly on the 
 IRQ stack?  Unfortunately I don't know enough about arm64 exception handling. 
Since I reviewed this IRQ stack patch in LAK-ML, I will be able to help you.
but I don't have enough time to explain in details this week.

...
 In any case, the bt should display "-- <IRQ stack>
...", and them dump
 the user-to-kernel-space exception frame, wherever it lies, i.e., either on the 
 normal process stack or (maybe?) on the IRQ stack. 

 Anyway, can you make the vmlinux/vmcore pair available for me to download?  You can
 send the details to me offline. 
I sent you a message which contains the link to those binaries.

Thanks,
-Takahiro AKASHI

...
 Thanks,
   Dave

 --
 Crash-utility mailing list
 Crash-utility(a)redhat.com
 https://www.redhat.com/mailman/listinfo/crash-utility 
-- 
Thanks,
-Takahiro AKASHI

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] arm64: odd backtrace?