Hi Dave,
Do you agree with these changes?
Yes.
Thank you.
Itsuro Oda
On Tue, 14 Oct 2008 16:30:18 -0400 (EDT)
Dave Anderson <anderson(a)redhat.com> wrote:
>
> Hello Oda-san,
>
> I have a xen-syms vmcore that finds a path that the hypervisor-related
> changes in lkcd_x86_trace.c cannot handle. When the back trace runs
> into the "process_softirqs" text return address reference from
> "xen/arch/x86/x86_32/entry.S", it cannot go any further. Therefore
> the backtrace fails, and in the recovery code it incorrectly searches
> for a (vmlinux) eframe:
>
> crash> bt -a
> PCPU: 0 VCPU: ffbc7080
> bt: cannot resolve stack trace:
> #0 [ff1d3ebc] elf_core_save_regs at ff10a810
> #1 [ff1d3ec4] common_interrupt at ff1222ed
> #2 [ff1d3ed0] do_nmi at ff1335bb
> #3 [ff1d3ef0] handle_nmi_mce at ff17442e
> #4 [ff1d3f24] csched_tick at ff110aa7
> #5 [ff1d3f80] timer_softirq_action at ff1155d2
> #6 [ff1d3fa0] do_softirq at ff1143fe
> #7 [ff1d3fb0] process_softirqs at ff173f61
> bt: text symbols on stack:
> [ff1d3ebc] disable_local_APIC at ff11db75
> [ff1d3ec0] crash_nmi_callback at ff13cc96
> [ff1d3ec4] common_interrupt at ff1222f2
> [ff1d3ed0] do_nmi at ff1335c1
> [ff1d3ef0] handle_nmi_mce at ff174435
> [ff1d3f18] csched_tick at ff110aa7
> [ff1d3f80] timer_softirq_action at ff1155d4
> [ff1d3fa0] do_softirq at ff114405
> [ff1d3fb0] process_softirqs at ff173f66
>
> bt: invalid structure size: task_struct
> FILE: x86.c LINE: 1576 FUNCTION: x86_eframe_search()
>
> [/usr/bin/crash] error trace: 816373b => 8164497 => 810c40c => 813ed94
>
> 813ed94: SIZE_verify+126
> 810c40c: x86_eframe_search+1075
> 8164497: handle_trace_error+692
> 816373b: lkcd_x86_back_trace+2370
>
> bt: invalid structure size: task_struct
> FILE: x86.c LINE: 1576 FUNCTION: x86_eframe_search()
>
> crash>
>
> Now, the bogus vmlinux eframe search can be avoided by doing this in
> handle_trace_error():
>
> --- lkcd_x86_trace.c.orig 2008-10-14 15:46:33.000000000 -0400
> +++ lkcd_x86_trace.c 2008-10-14 16:09:26.000000000 -0400
> @@ -2440,12 +2441,14 @@ handle_trace_error(struct bt_info *bt, i
> bt->flags |= BT_TEXT_SYMBOLS_PRINT|BT_ERROR_MASK;
> back_trace(bt);
>
> - bt->flags = BT_EFRAME_COUNT;
> - if ((cnt = machdep->eframe_search(bt))) {
> - error(INFO, "possible exception frame%s:\n",
> - cnt > 1 ? "s" : "");
> - bt->flags &= ~(ulonglong)BT_EFRAME_COUNT;
> - machdep->eframe_search(bt);
> + if (!XEN_HYPER_MODE()) {
> + bt->flags = BT_EFRAME_COUNT;
> + if ((cnt = machdep->eframe_search(bt))) {
> + error(INFO, "possible exception frame%s:\n",
> + cnt > 1 ? "s" : "");
> + bt->flags &= ~(ulonglong)BT_EFRAME_COUNT;
> + machdep->eframe_search(bt);
> + }
> }
> }
>
> After doing the above, the bt -a shows this, and therefore does
> not fail prematurely:
>
> crash> bt -a
> PCPU: 0 VCPU: ffbc7080
> bt: cannot resolve stack trace:
> #0 [ff1d3ebc] elf_core_save_regs at ff10a810
> #1 [ff1d3ec4] common_interrupt at ff1222ed
> #2 [ff1d3ed0] do_nmi at ff1335bb
> #3 [ff1d3ef0] handle_nmi_mce at ff17442e
> #4 [ff1d3f24] csched_tick at ff110aa7
> #5 [ff1d3f80] timer_softirq_action at ff1155d2
> #6 [ff1d3fa0] do_softirq at ff1143fe
> #7 [ff1d3fb0] process_softirqs at ff173f61
> bt: text symbols on stack:
> [ff1d3ebc] disable_local_APIC at ff11db75
> [ff1d3ec0] crash_nmi_callback at ff13cc96
> [ff1d3ec4] common_interrupt at ff1222f2
> [ff1d3ed0] do_nmi at ff1335c1
> [ff1d3ef0] handle_nmi_mce at ff174435
> [ff1d3f18] csched_tick at ff110aa7
> [ff1d3f80] timer_softirq_action at ff1155d4
> [ff1d3fa0] do_softirq at ff114405
> [ff1d3fb0] process_softirqs at ff173f66
>
> PCPU: 1 VCPU: ff1b6080
> ...
>
> Carrying it one step further, and given that the relevant part
> of the stack from above looks like this:
>
> crash> rd -s ff1d3ebc 84
> ff1d3ebc: disable_local_APIC+5 crash_nmi_callback+38 common_interrupt+82
cpu0_stack+16076
> ff1d3ecc: 0003d027 do_nmi+49 cpu0_stack+16120 00000000
> ff1d3edc: ffbca000 ffbcbeb0 00000030 cpu0_stack+16308
> ff1d3eec: 0000e010 handle_nmi_mce+91 cpu0_stack+16120 00000100
> ff1d3efc: 00000005 000000ff 000005dc ffbdee88
> ff1d3f0c: 00000000 00000960 00020000 csched_tick+1239
> ff1d3f1c: 0000e008 00000083 ffbc7080 00000030
> ff1d3f2c: 0003d027 80000003 000583a8 per_cpu__schedule_data
> ff1d3f3c: c840ceb2 00000000 ffbfda80 00000000
> ff1d3f4c: 00000000 00000000 00000100 00000960
> ff1d3f5c: ffbdee80 00000246 000000ff csched_priv+4
> ff1d3f6c: 00000000 ffbfda8c __per_cpu_data_end+54972 e4c5d8d9
> ff1d3f7c: 0000008b timer_softirq_action+132 00000000 ffbc7080
> ff1d3f8c: per_cpu__timers 00000000 cpu0_stack+16308 0000007b
> ff1d3f9c: eaed7700 do_softirq+53 00000000 ffbc7080
> ff1d3fac: 0000007b process_softirqs+6 eb396d84 00000002
> ff1d3fbc: c0678470 c0678470 00000002 eaed7700
> ff1d3fcc: 00000000 000d0000 c04011a7 00000061
> ff1d3fdc: 00000202 eb396d48 00000069 0000007b
> ff1d3fec: 0000007b 00000000 00000000 00000000
> ff1d3ffc: ffbc7080 ffffffff ffffffff ffffffff
> crash>
>
> Clearly "process_softirqs" is the last text return address
> reference that the backtrace code can work with. So to try
> to clean up the backtrace, I added this:
>
> --- lkcd_x86_trace.c.orig 2008-10-14 15:46:33.000000000 -0400
> +++ lkcd_x86_trace.c 2008-10-14 16:09:26.000000000 -0400
> @@ -1423,6 +1423,7 @@ find_trace(
> if (XEN_HYPER_MODE()) {
> func_name = kl_funcname(pc);
> if (STREQ(func_name, "idle_loop") ||
STREQ(func_name, "hypercall")
> + || STREQ(func_name, "process_softirqs")
> || STREQ(func_name, "tracing_off")
> || STREQ(func_name, "handle_exception")) {
> UPDATE_FRAME(func_name, pc, 0, sp, bp, asp, 0, 0, bp
- sp, 0);
>
> which shows:
>
> crash> bt -a
> PCPU: 0 VCPU: ffbc7080
> #0 [ff1d3ebc] elf_core_save_regs at ff10a810
> #1 [ff1d3ec4] common_interrupt at ff1222ed
> #2 [ff1d3ed0] do_nmi at ff1335bb
> #3 [ff1d3ef0] handle_nmi_mce at ff17442e
> #4 [ff1d3f24] csched_tick at ff110aa7
> #5 [ff1d3f80] timer_softirq_action at ff1155d2
> #6 [ff1d3fa0] do_softirq at ff1143fe
> #7 [ff1d3fb0] process_softirqs at ff173f61
>
> PCPU: 1 VCPU: ff1b6080
> ...
>
> The patch to avoid eframe search can be avoided entirely by applying
> the second patch, but it seems that it should be left in place for
> other unforeseen possibilities in the future.
>
Do you agree with these changes?
>
> Thanks,
> Dave
>
--
Itsuro ODA <oda(a)valinux.co.jp>