Re: [Crash-utility] arm64: "bt -f" output

Wednesday, 15 June 2016

----- Original Message -----
...
 On Tue, Jun 14, 2016 at 10:42:03AM -0400, Dave Anderson wrote:
 > 
 > > On Mon, Jun 13, 2016 at 11:30:24AM -0400, Dave Anderson wrote:
 > > > 
 > > > > Dave,
 > > > > 
 > > > > On Fri, Jun 10, 2016 at 04:37:42PM -0400, Dave Anderson wrote:
 > > > > > Hi Takahiro,
 > > > > > 
 > > > > > To address my concerns about your patch, I added a few
additional changes and attached
 > > > > > it to this email.  The changes are:
 > > > > > 
 > > > > > (1) Prevent the stack dump "below" the #0 level.  Yes,
the stack data region is contained within
 > > > > >     the incoming frame parameters, but it's ugly and we
really don't
 > > > > >     care to see what's before
 > > > > >     the #0 crash_kexec and crash_save_cpu #0 frames.
 > > > > > (2) Fill in the missing stack dump at the top of the process
stack, up to, but not including 
 > > > > >     the user-space exception frame.
 > > > > > (3) Instead of showing the fp of 0 in the top-most frame's
stack address, fill it in with the
 > > > > >     address of the user-space exception frame.
 > > > > > 
 > > > > > Note that there is no dump of the stack containing the
user-space exception frame, but the
 > > > > > register dump itself should suffice.
 > > > > 
 > > > > Well, the essential problem with my patch is that the output from
"bt -f"
 > > > > looks like:
 > > > >      #XX ['fp'] 'function' at 'pc'  --- (1)
 > > > >      <function's stack dump>        --- (2)
 > > > > but that (1) and (2) are not printed as a single stack frame in the
same
 > > > > iteration of while loop in arm64_back_trace_cmd().
 > > > > (I hope you understand what I mean :)
 > > > 
 > > > Actually I prefer your first approach.  I find this new one confusing,
not
 > > > to mention unlike any of the other architectures in that the "frame
level"
 > > > #X address value is not contiguous with the stack addresses that get
filled
 > > > in by -f.
 > > 
 > > Can you please elaborate a bit here about "is not contiguous"?
 > 
 > I mean that the #X [address] is not contiguous with the stack addresses
 > above and below it.  For example:
 > 
 >     ffff8003dc103d60: ffff8003dc103dc0 ffff80000028041c
 >     ffff8003dc103d70: 0000000000000000 0000000000000022
 >     ffff8003dc103d80: ffff8003db846b00 ffff8003db846b00
 >  #3 [ffff8003dc103cf0] schedule_hrtimeout_range_clock at ffff8000007786f0
 >     ffff8003dc103d90: ffff8003dc103dc0 ffff80000028052c
 >     ffff8003dc103da0: 0000000000000000 0000000000000022
 >     ffff8003dc103db0: 0000000000000000 0000000000000000
 > 
 > > 
 > > > Taking your picture into account:
 > > > 
 > > >          stack grows to lower addresses.
 > > >            /|\
 > > >             |
 > > >          |      |
 > > > new sp   +------+ <---
 > > >          |dyn   |   |
 > > >          | vars |   |
 > > > new fp   +- - - +   |
 > > >          |old fp|   | a function's stack frame
 > > >          |old lr|   |
 > > >          |static|   |
 > > >          |  vars|   |
 > > > old sp   +------+ <---
 > > >          |dyn   |
 > > >          | vars |
 > > > old fp   +------+
 > > >          |      |
 > > > 
 > > > Your first patch seemed natural to me because for any "#X" line
containing a function
 > > > name, that function's dynamic variables, the "old fp/old lr"
pair, and the function's
 > > > static variables were dumped below it (i.e., at higher stack
 > > > addresses).
 > > > 
 > > > 
 > > > > To be consistent with the out format of x86, the output should be
 > > > >      <function's stack dump>
 > > > >      #XX ['fp'] 'function' at 'pc'
 > > > > 
 > > > > Unfortunately, this requires that arm64_back_trace_cmd() and other
 > > > > functions should be overhauled.
 > > > > Please take a look at my next patch though it is uncompleted and
 > > > > still has room for improvement.
 > > > 
 > > > I don't know what you mean by "consistent with the out format of
x86"?  With x86_64,
 > > > each #<level> line is simply the stack address where the function 
pushed its return
 > > > address as a result of its making a "callq" to the next
function.  Any local variables of
 > > > the calling function would be at the next higher stack addresses:
 > > > 
 > > >   ...
 > > >   #X [stack address] function2 at 'return address'
 > > >   <function2's local variables>
 > > >   #Y [stack address] function1 at 'return address'
 > > >   <functions1's local variables>
 > > >   ...
 > > > 
 > > > So for digging out local stack variables associated with a function,
 > > > it's a simple
 > > > matter of looking "below" it in the "bt -f" output.
 > > 
 > > That is exactly what I meant by "consistent with x86."
 > > On x86, the output looks like:
 > > 
 > >    <function2's local variables>
 > >    #X [stack address] function2 at 'return address'
 > >    <functions1's local variables>
 > >    #Y [stack address] function1 at 'return address'
 > >    ...
 > 
 > No, that's not true -- look at my #X and #Y description above -- funcion2's
local
 > variables are at higher stack addresses than the #X "stack entry" address.
 They
 > have to be -- the callq that pushes the return address at the #X stack location is
 > the last stack manipulation that the function does.   Expressed otherwise, a
function's
 > local variables are displayed "below" or "underneath" its #X
line in the "bt -f"
 > output.

 OK

 > > 
 > > So users who are familiar with this format may get confused.
 > > (Or do I misunderstand anything?)
 > > 
 > > In addition, my previous patch displays
 > >    <function2's local variables>
 > >    #Y [stack address] function1 at 'return address'
 > > in arm64_print_stackframe_entry(), and it sounds odd to me.
 > 
 > BTW, the order in which it is done is based upon the kernel's dump_backtrace()
 > function, although I'll grant you that the kernel dump is only interested in
 > the pc.

 Good point. The kernel's dump_backtrace() doesn't show the contents of
 stack at each stack frame. If it did, we would see the same problem.

 In fact, backtracing a stack on arm64 precisely is a quite difficult
 thinking of possible corner cases. For example,
 a. A value of stack pointer is hard to determine
    It would be a good idea to use a dwarf unwinder.
 b. No actual stack frame for an exception entry code
    So we need a special handling around the entry code
 c. If an interrupt occurs during a function prologue, a stack frame
    can partially or even worse never be created on a stack.

 If you're interested, see my past RFC:
 http://lists.infradead.org/pipermail/linux-arm-kernel/2015-September/3693...

 You may understand why I get so stuck with backtrace issues. 
Uhhhh, yeah -- I do now!   ;-) 

...

 > > 
 > > But, anyhow, it's up to you.
 > > 
 > 
 > OK!  Thanks for giving in...  ;-)

 Not really yet :)

 >From the vmlinux/vmcore binaries that I sent you before,
 with my v1 (and also Dave's) patch,
 crash> bt -f
 PID: 1223   TASK: ffff800020ef5780  CPU: 3   COMMAND: "sh"
  #0 [ffff800020b0ba70] crash_kexec at ffff00000812b0ac
     ffff800020b0ba60: ffffffffffffffff 6533333035642030
     ffff800020b0ba70: ffff800020b0ba90 ffff000008088ce8
     ffff800020b0ba80: ffff000008e67000 ffff800020b0bc20
  ...
  #4 [ffff800020b0bb60] do_translation_fault at ffff00000809690c
     ffff800020b0bb60: ffff800020b0bb70 ffff00000808128c
  #5 [ffff800020b0bb70] do_mem_abort at ffff00000808128c
     ffff800020b0bb70: ffff800020b0bd40 ffff000008084568
     ffff800020b0bb80: ffff000008dda000 0000000000000063
  ...
     ffff800020b0bc00: 0000000000000000 0000000000000015
     ffff800020b0bc10: 0000000000000120 0000000000000040
     ffff800020b0bc20: 0000000000000001 0000000000000000   <- (1)
     ffff800020b0bc30: ffff000008dda7b8 0000000000000000
     ffff800020b0bc40: 0000000000000000 00000000000047d4
  ...
     ffff800020b0bd10: ffff000008457fb4 ffff800020b0bd40
     ffff800020b0bd20: ffff000008457fc8 0000000060400149
     ffff800020b0bd30: ffff000008dda000 ffff80002104d418
  #6 [ffff800020b0bd40] el1_da at ffff000008084568
      PC: ffff000008457fc8  [sysrq_handle_crash+32]
      LR: ffff000008457fb4  [sysrq_handle_crash+12]
      SP: ffff800020b0bd40  PSTATE: 60400149
  ...
     ORIG_X0: ffff000008dda000  SYSCALLNO: ffff80002104d418 <- (2)
     ffff800020b0bd40: ffff800020b0bd50 ffff000008458644
  #7 [ffff800020b0bd50] __handle_sysrq at ffff000008458644  <- (3)
     ffff800020b0bd50: ffff800020b0bd90 ffff000008458ac0
     ffff800020b0bd60: 0000000000000002 fffffffffffffffb
     ffff800020b0bd70: 0000000000000000 ffff800020b0bec8
     ffff800020b0bd80: 000000001e9da808 ffff800021ba84a0
  ...

 (1) This is actually an end of stack frame of #5 (do_mem_abort),
     and also a start of exception frame of #6 (el1_da).

 (2) Those are meaningless for any exceptions in the kernel.

 (3) We miss one stack frame for "sysrq_handle_crash"
     due to (b) I mentioned above.

 Thanks,
 -Takahiro AKASHI  
Notwithstanding the cases above, but given the improvement in readability,
I've checked the patch in as the basis for any future patches you may 
propose in the future:

  https://github.com/crash-utility/crash/commit/d1db3ff7581cab2ec5263291afc...

Thanks,
  Dave

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] arm64: "bt -f" output