On Tue, Jun 14, 2016 at 10:42:03AM -0400, Dave Anderson wrote:
> On Mon, Jun 13, 2016 at 11:30:24AM -0400, Dave Anderson wrote:
> >
> > > Dave,
> > >
> > > On Fri, Jun 10, 2016 at 04:37:42PM -0400, Dave Anderson wrote:
> > > > Hi Takahiro,
> > > >
> > > > To address my concerns about your patch, I added a few additional
> > > > changes and attached
> > > > it to this email. The changes are:
> > > >
> > > > (1) Prevent the stack dump "below" the #0 level. Yes, the
stack data
> > > > region is contained within
> > > > the incoming frame parameters, but it's ugly and we really
don't
> > > > care to see what's before
> > > > the #0 crash_kexec and crash_save_cpu #0 frames.
> > > > (2) Fill in the missing stack dump at the top of the process stack,
up
> > > > to, but not including
> > > > the user-space exception frame.
> > > > (3) Instead of showing the fp of 0 in the top-most frame's stack
> > > > address, fill it in with the
> > > > address of the user-space exception frame.
> > > >
> > > > Note that there is no dump of the stack containing the user-space
> > > > exception frame, but the
> > > > register dump itself should suffice.
> > >
> > > Well, the essential problem with my patch is that the output from "bt
-f"
> > > looks like:
> > > #XX ['fp'] 'function' at 'pc' --- (1)
> > > <function's stack dump> --- (2)
> > > but that (1) and (2) are not printed as a single stack frame in the same
> > > iteration of while loop in arm64_back_trace_cmd().
> > > (I hope you understand what I mean :)
> >
> > Actually I prefer your first approach. I find this new one confusing, not
> > to mention unlike any of the other architectures in that the "frame
level"
> > #X address value is not contiguous with the stack addresses that get filled
> > in by -f.
>
> Can you please elaborate a bit here about "is not contiguous"?
I mean that the #X [address] is not contiguous with the stack addresses
above and below it. For example:
ffff8003dc103d60: ffff8003dc103dc0 ffff80000028041c
ffff8003dc103d70: 0000000000000000 0000000000000022
ffff8003dc103d80: ffff8003db846b00 ffff8003db846b00
#3 [ffff8003dc103cf0] schedule_hrtimeout_range_clock at ffff8000007786f0
ffff8003dc103d90: ffff8003dc103dc0 ffff80000028052c
ffff8003dc103da0: 0000000000000000 0000000000000022
ffff8003dc103db0: 0000000000000000 0000000000000000
>
> > Taking your picture into account:
> >
> > stack grows to lower addresses.
> > /|\
> > |
> > | |
> > new sp +------+ <---
> > |dyn | |
> > | vars | |
> > new fp +- - - + |
> > |old fp| | a function's stack frame
> > |old lr| |
> > |static| |
> > | vars| |
> > old sp +------+ <---
> > |dyn |
> > | vars |
> > old fp +------+
> > | |
> >
> > Your first patch seemed natural to me because for any "#X" line
containing a function
> > name, that function's dynamic variables, the "old fp/old lr"
pair, and the function's
> > static variables were dumped below it (i.e., at higher stack addresses).
> >
> >
> > > To be consistent with the out format of x86, the output should be
> > > <function's stack dump>
> > > #XX ['fp'] 'function' at 'pc'
> > >
> > > Unfortunately, this requires that arm64_back_trace_cmd() and other
functions should be overhauled.
> > > Please take a look at my next patch though it is uncompleted and still has
room for improvement.
> >
> > I don't know what you mean by "consistent with the out format of
x86"? With x86_64,
> > each #<level> line is simply the stack address where the function pushed
its return
> > address as a result of its making a "callq" to the next function.
Any local variables of
> > the calling function would be at the next higher stack addresses:
> >
> > ...
> > #X [stack address] function2 at 'return address'
> > <function2's local variables>
> > #Y [stack address] function1 at 'return address'
> > <functions1's local variables>
> > ...
> >
> > So for digging out local stack variables associated with a function, it's a
simple
> > matter of looking "below" it in the "bt -f" output.
>
> That is exactly what I meant by "consistent with x86."
> On x86, the output looks like:
>
> <function2's local variables>
> #X [stack address] function2 at 'return address'
> <functions1's local variables>
> #Y [stack address] function1 at 'return address'
> ...
No, that's not true -- look at my #X and #Y description above -- funcion2's
local
variables are at higher stack addresses than the #X "stack entry" address.
They
have to be -- the callq that pushes the return address at the #X stack location is
the last stack manipulation that the function does. Expressed otherwise, a
function's
local variables are displayed "below" or "underneath" its #X line in
the "bt -f"
output.
OK
>
> So users who are familiar with this format may get confused.
> (Or do I misunderstand anything?)
>
> In addition, my previous patch displays
> <function2's local variables>
> #Y [stack address] function1 at 'return address'
> in arm64_print_stackframe_entry(), and it sounds odd to me.
BTW, the order in which it is done is based upon the kernel's dump_backtrace()
function, although I'll grant you that the kernel dump is only interested in
the pc.
Good point. The kernel's dump_backtrace() doesn't show the contents of
stack at each stack frame. If it did, we would see the same problem.
In fact, backtracing a stack on arm64 precisely is a quite difficult
thinking of possible corner cases. For example,
a. A value of stack pointer is hard to determine
It would be a good idea to use a dwarf unwinder.
b. No actual stack frame for an exception entry code
So we need a special handling around the entry code
c. If an interrupt occurs during a function prologue, a stack frame
can partially or even worse never be created on a stack.
If you're interested, see my past RFC:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-September/3693...
You may understand why I get so stuck with backtrace issues.
>
> But, anyhow, it's up to you.
>
OK! Thanks for giving in... ;-)
Not really yet :)
From the vmlinux/vmcore binaries that I sent you before,
with my
v1 (and also Dave's) patch,
crash> bt -f
PID: 1223 TASK: ffff800020ef5780 CPU: 3 COMMAND: "sh"
#0 [ffff800020b0ba70] crash_kexec at ffff00000812b0ac
ffff800020b0ba60: ffffffffffffffff 6533333035642030
ffff800020b0ba70: ffff800020b0ba90 ffff000008088ce8
ffff800020b0ba80: ffff000008e67000 ffff800020b0bc20
...
#4 [ffff800020b0bb60] do_translation_fault at ffff00000809690c
ffff800020b0bb60: ffff800020b0bb70 ffff00000808128c
#5 [ffff800020b0bb70] do_mem_abort at ffff00000808128c
ffff800020b0bb70: ffff800020b0bd40 ffff000008084568
ffff800020b0bb80: ffff000008dda000 0000000000000063
...
ffff800020b0bc00: 0000000000000000 0000000000000015
ffff800020b0bc10: 0000000000000120 0000000000000040
ffff800020b0bc20: 0000000000000001 0000000000000000 <- (1)
ffff800020b0bc30: ffff000008dda7b8 0000000000000000
ffff800020b0bc40: 0000000000000000 00000000000047d4
...
ffff800020b0bd10: ffff000008457fb4 ffff800020b0bd40
ffff800020b0bd20: ffff000008457fc8 0000000060400149
ffff800020b0bd30: ffff000008dda000 ffff80002104d418
#6 [ffff800020b0bd40] el1_da at ffff000008084568
PC: ffff000008457fc8 [sysrq_handle_crash+32]
LR: ffff000008457fb4 [sysrq_handle_crash+12]
SP: ffff800020b0bd40 PSTATE: 60400149
...
ORIG_X0: ffff000008dda000 SYSCALLNO: ffff80002104d418 <- (2)
ffff800020b0bd40: ffff800020b0bd50 ffff000008458644
#7 [ffff800020b0bd50] __handle_sysrq at ffff000008458644 <- (3)
ffff800020b0bd50: ffff800020b0bd90 ffff000008458ac0
ffff800020b0bd60: 0000000000000002 fffffffffffffffb
ffff800020b0bd70: 0000000000000000 ffff800020b0bec8
ffff800020b0bd80: 000000001e9da808 ffff800021ba84a0
...
(1) This is actually an end of stack frame of #5 (do_mem_abort),
and also a start of exception frame of #6 (el1_da).
(2) Those are meaningless for any exceptions in the kernel.
(3) We miss one stack frame for "sysrq_handle_crash"
due to (b) I mentioned above.
Thanks,
-Takahiro AKASHI
Thanks,
Dave
--
Crash-utility mailing list
Crash-utility(a)redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility