----- Original Message -----
Takahiro,
Sorry for the previous empty response -- I sent it by mistake while
putting this 3-message collaborative response together.
Dave,
If you don't like my patches, that is OK.
Applying them or not is totally up to you.
I will *never* submit my patches again.
Well, that's certainly unfortunate.
Having said so, I think I would better explain my intentions on the
code.
On Wed, Jun 29, 2016 at 04:09:01PM -0400, Dave Anderson wrote:
>
>
> Hi Takahiro,
>
> I applied patches 1/2 and 2/2 from the v5 patchset. But I can't
> believe the results are what you intended?
>
> For example, taking the 4.6 vmcore that you gave to me, here is the
> current crash utility's output of "bt -a", where the crashing task
> entered crash_kexec() via the sysrq-c page fault exception, and the
> tasks on the other cpus have all entered crash_save_cpu() on their
> IRQ stack as a result of the shutdown IPI, one from user-space and
> the others from the kernel:
>
> crash> bt -a
> PID: 0 TASK: ffff000008dcd900 CPU: 0 COMMAND: "swapper/0"
> #0 [ffff800022f42e50] crash_save_cpu at ffff00000812ae44
> #1 [ffff800022f43010] handle_IPI at ffff00000808e718
> #2 [ffff800022f43040] gic_handle_irq at ffff0000080815f8
> #3 [ffff800022f43080] el1_irq at ffff000008084720
> --- <IRQ stack> ---
This part of dump shown above is the result of "kdump" mechanisms,
and is *not* useful for you to understand what was happening on this
CPU when a crash occurred on the other CPU, #3 in this case.
> PC: ffff0000080857c0 [arch_cpu_idle+16]
> LR: ffff0000080857bc [arch_cpu_idle+12]
> SP: ffff000008dc3f10 PSTATE: 60400149
> X29: ffff000008dc3f10 X28: ffff000008dc0000 X27: 0000000000000000
> X26: ffff000008dc5000 X25: ffff000008dc5c1c X24: ffff000008dc5000
> X23: ffff000008dc0000 X22: ffff000008bd0270 X21: ffff000008dc0000
> X20: ffff000008dc5b88 X19: 0000000000000000 X18: 00000000a632f641
> X17: 0000ffff7da57880 X16: ffff0000081d9838 X15: 00000000383a0a79
> X14: 00000000b2b0b162 X13: 000000005f2cbeec X12: 0000000000045a9e
> X11: ffff8000213bd800 X10: 0000000000000850 X9: ffff000008dc0000
> X8: 000000010000aa07 X7: 000000000000003d X6: 0015752a00000000
> X5: 0100000000000000 X4: 00000000000001c0 X3: 0000000000000000
> X2: 0000000000000001 X1: 0000000000000000 X0: 0000000000000000
> ORIG_X0: 0000000000000000 SYSCALLNO: 7fffffffffffffff
> #4 [ffff000008dc3f10] arch_cpu_idle at ffff0000080857c0
> #5 [ffff000008dc3f20] cpu_startup_entry at ffff0000080f26cc
> #6 [ffff000008dc3f80] rest_init at ffff0000087c7930
> #7 [ffff000008dc3fa0] start_kernel at ffff000008b10b70
All you need to know is that CPU#0 was idle at the crash point.
Whatever else do you want to know?
Things like:
(1) I want to know whether the active cpu was properly shut down by
the kdump mechanism.
(2) I would like to actually "test" that the crash utility's transition from
the IRQ stack back to the process stack works as expected. (especially given
the relative rarity of kernel crashes that actually occur while operating
on IRQ stacks)
(3) There are several other dumping mechanisms that can be utilized on ARM64
systems, so again, I'd like it to be obvious that this is kdump.
(4) When doing a filtered "foreach bt" it would be helpful to simply make it
obvious that a task was an active, non-crashing, task -- with more information
than a "hanging" exception frame dump.
(5) All of the other architectures do it.
... [ cut ] ...
> But why do you think that it is an improvement to leave out the
transition
> to the IRQ stack?
I explained the reason in my previous e-mail.
I still believe that it is an improvement.
OK, we're just going to have to agree to disagree.
> In any case, I haven't even started looking at the "bt
-f" part of the patch
> because quite frankly, this patchset is so complex that I haven't even
> begun to try to understand it. When you said it would be "easier to
> maintain", well, perhaps for *you* maybe, but certainly not for me!
Do you think so?
It is because you are the author.
Maybe so, but I absolutely do think so.
The changes I made were big and look complex to you,
but the resulting code is simple enough for others IMO.
I tried to make the main loop of arm64_back_trace_cmd() simple and
quite resemble the counterpart of dump_backtrace() in the kernel.
From my perspective, the current arm64_back_trace_cmd() is far
simpler,
and is essentially a backport of the kernel's dump_backtrace() code.
The differences come mainly from the facts:
1.you (crash util) need to display not only a one-line frame summary
(with a value of stack/frame pointer), but also a full dump of stack
Right, but the current loop remains basically the same as the kernel's.
2.your current code will miss a *very important* stack frame that is
the one exactly when an interrupt takes place.
I believe that the latter is a big improvement.
I'm not following when you say it "will miss" a frame. It's just
following
the links established by the architecture. If you line up the backtrace output
of the current vs. your patch, there are no missing frames, but rather you
essentially move the "el1_da()" stack frame entry to a lower stack address,
and your "bt -f" implementation doesn't dump the raw exception frame data.
Yeah, due to the changes, arm64_unwind_frame() may get a bit
complicated,
but it is very naive implementation based on the nature of the kernel's
stack usage (or PCS for ARM), the complexity is not the result of my poor
skill of coding.
In other words, the complexity is now *encapsulated* in that function.
I understand what you're saying, and certainly appreciate the depth of
knowledge that you have, but quite frankly I don't find the changes
significant enough to reinvent the wheel.
In addition, I wrote down that function step-by-step, from generic
cases
to exceptional cases, with a bunch of comments for better understandings.
And you will be able to remove such exceptional cases *if* you don't like
them.
Given those stuffs, I think that my code is easier to maintain.
Again, surely up to you.
> Without the kernel's backtrace code from which the current code is based, I
don't
> have anything to work from anymore. I'm really not sure whether the coverage
of
> the "corner cases" you referred to make this effort worth it. I would
like to
> see examples of how the current code fails. But anyway, I will continue to test it
to
> see if there actually is any significant upgrade from what we already have in
> place. From a kernel debugging perspective, all we really need is a basic
unwinder,
> and optionally the full dump of the stack data in between those frames. And the
> current code does do that at a minimum, and there's much to be said for
> simplicity.
>
... [ cut ] ...
> Here is another thing that I would prefer not to change/omit.
>
> In the current code, the raw exception frame data is dumped as
> part of the "bt -[fF]" output, just prior to it being translated
> as an exception frame:
>
> crash> bt -F
> PID: 1223 TASK: ffff800020ef5780 CPU: 3 COMMAND: "sh"
> ... [ cut ] ...
> #5 [ffff800020b0bb70] do_mem_abort at ffff00000808128c
> ffff800020b0bb70: ffff800020b0bd40 el1_da+24
> ffff800020b0bb80: cpu_cgrp_subsys+152 0000000000000063
> ffff800020b0bb90: ffff800020b0bd40 sysrq_handle_crash+32
> ffff800020b0bba0: 0000000000000002 textbuf.34610
> ffff800020b0bbb0: ffff800020b0bbd0 kallsyms_token_index+43128
> ffff800020b0bbc0: 000000000000000f 0000000100000000
> ffff800020b0bbd0: ffff800020b0bc70 vprintk_default+56
> ffff800020b0bbe0: cpu_cgrp_subsys+152 0000000000000063
> ffff800020b0bbf0: sysrq_crash_op 0000000000000009
> ffff800020b0bc00: 0000000000000000 0000000000000015
> ffff800020b0bc10: 0000000000000120 0000000000000040
> ffff800020b0bc20: 0000000000000001 0000000000000000
> ffff800020b0bc30: log_wait+8 0000000000000000
> ffff800020b0bc40: 0000000000000000 00000000000047d4
> ffff800020b0bc50: ffff800022f337a4 0000000000000000
> ffff800020b0bc60: 0000000000000106 0000000000000001
> ffff800020b0bc70: 0000000000000002 0000000000000106
> ffff800020b0bc80: log_buf_len cont
> ffff800020b0bc90: 0000ffff83cc28f0 text.34829+13
> ffff800020b0bca0: sys_write 0000ffff83d266c0
> ffff800020b0bcb0: 0000000000000006 cpu_cgrp_subsys+152
> ffff800020b0bcc0: 0000000000000063 sysrq_crash_op
> ffff800020b0bcd0: 0000000000000009 0000000000000000
> ffff800020b0bce0: 0000000000000015 0000000000000120
> ffff800020b0bcf0: 0000000000000040 sys_call_table
> ffff800020b0bd00: ffff800020b08000 ffff800020b0bd40
> ffff800020b0bd10: sysrq_handle_crash+12 ffff800020b0bd40
> ffff800020b0bd20: sysrq_handle_crash+32 0000000060400149
> ffff800020b0bd30: cpu_cgrp_subsys+152 [kmalloc-1024]
> #6 [ffff800020b0bd40] el1_da at ffff000008084568
> PC: ffff000008457fc8 [sysrq_handle_crash+32]
> LR: ffff000008457fb4 [sysrq_handle_crash+12]
> SP: ffff800020b0bd40 PSTATE: 60400149
> X29: ffff800020b0bd40 X28: ffff800020b08000 X27: ffff0000087e2000
> X26: 0000000000000040 X25: 0000000000000120 X24: 0000000000000015
> X23: 0000000000000000 X22: 0000000000000009 X21: ffff000008e071b0
> X20: 0000000000000063 X19: ffff000008dda000 X18: 0000000000000006
> X17: 0000ffff83d266c0 X16: ffff0000081c68b8 X15: ffff000008e6cc95
> X14: 0000ffff83cc28f0 X13: ffff000008e6c758 X12: ffff000008dda7a0
> X11: 0000000000000106 X10: 0000000000000002 X9: 0000000000000001
> X8: 0000000000000106 X7: 0000000000000000 X6: ffff800022f337a4
> X5: 00000000000047d4 X4: 0000000000000000 X3: 0000000000000000
> X2: ffff000008dda7b8 X1: 0000000000000000 X0: 0000000000000001
> ORIG_X0: ffff000008dda000 SYSCALLNO: ffff80002104d418
> ...
>
> whereas with the v5 patchset, the exception frame only gets translated,
> but the actual raw memory never gets dumped:
I surely remember that you said that would not be an issue
when I submitted older version, maybe v1 or v2.
Well, yes, that's true in the case of an exception frame generated
by the IPI interrupt (or any interrupt), where the eframe contents are
incidental to the interrupt's arrival, and not nearly as interesting
as an in-line kernel exception frame generated by a NULL pointer
reference, BUG(), etc., where the contents of the exception frame
are completely related to the problem/bug at hand. And in that case,
the current code does dump the raw contents just before the register
translation.
Do you think that those symbolic display are still useful
though it is not quite easy to recognize which register has what value?
It's not a major item, but it certainly just dumps the raw memory
in between frame like anywhere else in the "bt -f" output. Seems
strange to just omit it there.
Even more, <ffff800020b0bb80-ffff800020b0bc10> is *not* a stack for
do_mem_abort(). It is just wrong and will confuse people.
So this is another example of improvement on my patches.
I'm not following -- your patch (as well as the current version) both
show ffff800020b0bb80-ffff800020b0bc10 below do_mem_abort():
...
#5 [ffff800020b0bb70] do_mem_abort at ffff000008081288
ffff800020b0bb70: ffff800020b0bd40 el1_da+24
ffff800020b0bb80: cpu_cgrp_subsys+152 0000000000000063
ffff800020b0bb90: ffff800020b0bd40 sysrq_handle_crash+32
ffff800020b0bba0: 0000000000000002 textbuf.34610
ffff800020b0bbb0: ffff800020b0bbd0 kallsyms_token_index+43128
ffff800020b0bbc0: 000000000000000f 0000000100000000
ffff800020b0bbd0: ffff800020b0bc70 vprintk_default+56
ffff800020b0bbe0: cpu_cgrp_subsys+152 0000000000000063
ffff800020b0bbf0: sysrq_crash_op 0000000000000009
ffff800020b0bc00: 0000000000000000 0000000000000015
ffff800020b0bc10: 0000000000000120 0000000000000040
#6 [ffff800020b0bc20] el1_da at ffff000008084564
--- <Exception in kernel> ---
PC: ffff000008457fc8 [sysrq_handle_crash+32]
LR: ffff000008457fb4 [sysrq_handle_crash+12]
SP: ffff800020b0bd40 PSTATE: 60400149
...
The current code does the above, but dumps the raw exception frame
before printing a stack entry for el1_da():
...
#5 [ffff800020b0bb70] do_mem_abort at ffff00000808128c
ffff800020b0bb70: ffff800020b0bd40 el1_da+24
ffff800020b0bb80: cpu_cgrp_subsys+152 0000000000000063
ffff800020b0bb90: ffff800020b0bd40 sysrq_handle_crash+32
ffff800020b0bba0: 0000000000000002 textbuf.34610
ffff800020b0bbb0: ffff800020b0bbd0 kallsyms_token_index+43128
ffff800020b0bbc0: 000000000000000f 0000000100000000
ffff800020b0bbd0: ffff800020b0bc70 vprintk_default+56
ffff800020b0bbe0: cpu_cgrp_subsys+152 0000000000000063
ffff800020b0bbf0: sysrq_crash_op 0000000000000009
ffff800020b0bc00: 0000000000000000 0000000000000015
ffff800020b0bc10: 0000000000000120 0000000000000040
ffff800020b0bc20: 0000000000000001 0000000000000000
ffff800020b0bc30: log_wait+8 0000000000000000
ffff800020b0bc40: 0000000000000000 00000000000047d4
ffff800020b0bc50: ffff800022f337a4 0000000000000000
ffff800020b0bc60: 0000000000000106 0000000000000001
ffff800020b0bc70: 0000000000000002 0000000000000106
ffff800020b0bc80: log_buf_len cont
ffff800020b0bc90: 0000ffff83cc28f0 text.34829+13
ffff800020b0bca0: sys_write 0000ffff83d266c0
ffff800020b0bcb0: 0000000000000006 cpu_cgrp_subsys+152
ffff800020b0bcc0: 0000000000000063 sysrq_crash_op
ffff800020b0bcd0: 0000000000000009 0000000000000000
ffff800020b0bce0: 0000000000000015 0000000000000120
ffff800020b0bcf0: 0000000000000040 sys_call_table
ffff800020b0bd00: ffff800020b08000 ffff800020b0bd40
ffff800020b0bd10: sysrq_handle_crash+12 ffff800020b0bd40
ffff800020b0bd20: sysrq_handle_crash+32 0000000060400149
ffff800020b0bd30: cpu_cgrp_subsys+152 [kmalloc-1024]
#6 [ffff800020b0bd40] el1_da at ffff000008084568
PC: ffff000008457fc8 [sysrq_handle_crash+32]
LR: ffff000008457fb4 [sysrq_handle_crash+12]
SP: ffff800020b0bd40 PSTATE: 60400149
...
Yes, again I understand that you feel that the el1_da() frame
should be at ffff800020b0bc20, but on the other hand, the
link register/pc under do_mem_abort() points to ffff800020b0bd40,
and I kind of prefer to show that. I *know* you disagree.
... [ cut ] ...
IMO I think that you'd better improve the output of
arm64_print_exception_frame() for bt -F.
And this is an totally independent issue from other parts of my patches.
I don't see where "bt -F" gets involved in arm64_print_exception_frame()?
Are you referring to this, which I agree is OK:
@@ -2130,10 +2343,11 @@ arm64_print_exception_frame(struct bt_info *bt, ulong pt_regs, int
mode, FILE *o
}
if (is_64_bit) {
- fprintf(ofp, "ORIG_X0: %016lx SYSCALLNO: %lx",
- (ulong)regs->orig_x0, (ulong)regs->syscallno);
- if (mode == USER_MODE)
+ if (mode == USER_MODE) {
+ fprintf(ofp, "ORIG_X0: %016lx SYSCALLNO: %lx",
+ (ulong)regs->orig_x0, (ulong)regs->syscallno);
fprintf(ofp, " PSTATE: %08lx",
(ulong)regs->pstate);
+ }
fprintf(ofp, "\n");
}
Anyway, I'm going to be away for the next few days, and will revisit
this all again next week.
But here's a thought:
There already are "bt -o" and "bt -O" for x86 and x86_64, which allow
for the existence of coexisting but separate back trace methods.
I'm thinking that we can keep the current code, while also introducing
your patchset as an "optional" back trace method. There would be
arm64_back_trace_cmd_v1() and arm64_back_trace_cmd_v2(), arm64_unwind_frame_v1()
and arm64_unwind_frame_v2(), and any other functions that can't comfortably
be utilized by both methods. And then "bt -o" would use the optional
version, or the default method selection could be toggled with "bt -O".
And again, I hope we don't lose you. During the KASLR negotiations, you've
already seen what an obstinate ass I can be. But it's really part of my
job as a gatekeeper.
Thanks,
Dave
"Simplicity, simplicity, simplicity."
- Henry David Thoreau