Hello Oda-san,
I have a xen-syms vmcore that finds a path that the hypervisor-related
changes in lkcd_x86_trace.c cannot handle. When the back trace runs
into the "process_softirqs" text return address reference from
"xen/arch/x86/x86_32/entry.S", it cannot go any further. Therefore
the backtrace fails, and in the recovery code it incorrectly searches
for a (vmlinux) eframe:
crash> bt -a
PCPU: 0 VCPU: ffbc7080
bt: cannot resolve stack trace:
#0 [ff1d3ebc] elf_core_save_regs at ff10a810
#1 [ff1d3ec4] common_interrupt at ff1222ed
#2 [ff1d3ed0] do_nmi at ff1335bb
#3 [ff1d3ef0] handle_nmi_mce at ff17442e
#4 [ff1d3f24] csched_tick at ff110aa7
#5 [ff1d3f80] timer_softirq_action at ff1155d2
#6 [ff1d3fa0] do_softirq at ff1143fe
#7 [ff1d3fb0] process_softirqs at ff173f61
bt: text symbols on stack:
[ff1d3ebc] disable_local_APIC at ff11db75
[ff1d3ec0] crash_nmi_callback at ff13cc96
[ff1d3ec4] common_interrupt at ff1222f2
[ff1d3ed0] do_nmi at ff1335c1
[ff1d3ef0] handle_nmi_mce at ff174435
[ff1d3f18] csched_tick at ff110aa7
[ff1d3f80] timer_softirq_action at ff1155d4
[ff1d3fa0] do_softirq at ff114405
[ff1d3fb0] process_softirqs at ff173f66
bt: invalid structure size: task_struct
FILE: x86.c LINE: 1576 FUNCTION: x86_eframe_search()
[/usr/bin/crash] error trace: 816373b => 8164497 => 810c40c => 813ed94
813ed94: SIZE_verify+126
810c40c: x86_eframe_search+1075
8164497: handle_trace_error+692
816373b: lkcd_x86_back_trace+2370
bt: invalid structure size: task_struct
FILE: x86.c LINE: 1576 FUNCTION: x86_eframe_search()
crash>
Now, the bogus vmlinux eframe search can be avoided by doing this in
handle_trace_error():
--- lkcd_x86_trace.c.orig 2008-10-14 15:46:33.000000000 -0400
+++ lkcd_x86_trace.c 2008-10-14 16:09:26.000000000 -0400
@@ -2440,12 +2441,14 @@ handle_trace_error(struct bt_info *bt, i
bt->flags |= BT_TEXT_SYMBOLS_PRINT|BT_ERROR_MASK;
back_trace(bt);
- bt->flags = BT_EFRAME_COUNT;
- if ((cnt = machdep->eframe_search(bt))) {
- error(INFO, "possible exception frame%s:\n",
- cnt > 1 ? "s" : "");
- bt->flags &= ~(ulonglong)BT_EFRAME_COUNT;
- machdep->eframe_search(bt);
+ if (!XEN_HYPER_MODE()) {
+ bt->flags = BT_EFRAME_COUNT;
+ if ((cnt = machdep->eframe_search(bt))) {
+ error(INFO, "possible exception frame%s:\n",
+ cnt > 1 ? "s" : "");
+ bt->flags &= ~(ulonglong)BT_EFRAME_COUNT;
+ machdep->eframe_search(bt);
+ }
}
}
After doing the above, the bt -a shows this, and therefore does
not fail prematurely:
crash> bt -a
PCPU: 0 VCPU: ffbc7080
bt: cannot resolve stack trace:
#0 [ff1d3ebc] elf_core_save_regs at ff10a810
#1 [ff1d3ec4] common_interrupt at ff1222ed
#2 [ff1d3ed0] do_nmi at ff1335bb
#3 [ff1d3ef0] handle_nmi_mce at ff17442e
#4 [ff1d3f24] csched_tick at ff110aa7
#5 [ff1d3f80] timer_softirq_action at ff1155d2
#6 [ff1d3fa0] do_softirq at ff1143fe
#7 [ff1d3fb0] process_softirqs at ff173f61
bt: text symbols on stack:
[ff1d3ebc] disable_local_APIC at ff11db75
[ff1d3ec0] crash_nmi_callback at ff13cc96
[ff1d3ec4] common_interrupt at ff1222f2
[ff1d3ed0] do_nmi at ff1335c1
[ff1d3ef0] handle_nmi_mce at ff174435
[ff1d3f18] csched_tick at ff110aa7
[ff1d3f80] timer_softirq_action at ff1155d4
[ff1d3fa0] do_softirq at ff114405
[ff1d3fb0] process_softirqs at ff173f66
PCPU: 1 VCPU: ff1b6080
...
Carrying it one step further, and given that the relevant part
of the stack from above looks like this:
crash> rd -s ff1d3ebc 84
ff1d3ebc: disable_local_APIC+5 crash_nmi_callback+38 common_interrupt+82
cpu0_stack+16076
ff1d3ecc: 0003d027 do_nmi+49 cpu0_stack+16120 00000000
ff1d3edc: ffbca000 ffbcbeb0 00000030 cpu0_stack+16308
ff1d3eec: 0000e010 handle_nmi_mce+91 cpu0_stack+16120 00000100
ff1d3efc: 00000005 000000ff 000005dc ffbdee88
ff1d3f0c: 00000000 00000960 00020000 csched_tick+1239
ff1d3f1c: 0000e008 00000083 ffbc7080 00000030
ff1d3f2c: 0003d027 80000003 000583a8 per_cpu__schedule_data
ff1d3f3c: c840ceb2 00000000 ffbfda80 00000000
ff1d3f4c: 00000000 00000000 00000100 00000960
ff1d3f5c: ffbdee80 00000246 000000ff csched_priv+4
ff1d3f6c: 00000000 ffbfda8c __per_cpu_data_end+54972 e4c5d8d9
ff1d3f7c: 0000008b timer_softirq_action+132 00000000 ffbc7080
ff1d3f8c: per_cpu__timers 00000000 cpu0_stack+16308 0000007b
ff1d3f9c: eaed7700 do_softirq+53 00000000 ffbc7080
ff1d3fac: 0000007b process_softirqs+6 eb396d84 00000002
ff1d3fbc: c0678470 c0678470 00000002 eaed7700
ff1d3fcc: 00000000 000d0000 c04011a7 00000061
ff1d3fdc: 00000202 eb396d48 00000069 0000007b
ff1d3fec: 0000007b 00000000 00000000 00000000
ff1d3ffc: ffbc7080 ffffffff ffffffff ffffffff
crash>
Clearly "process_softirqs" is the last text return address
reference that the backtrace code can work with. So to try
to clean up the backtrace, I added this:
--- lkcd_x86_trace.c.orig 2008-10-14 15:46:33.000000000 -0400
+++ lkcd_x86_trace.c 2008-10-14 16:09:26.000000000 -0400
@@ -1423,6 +1423,7 @@ find_trace(
if (XEN_HYPER_MODE()) {
func_name = kl_funcname(pc);
if (STREQ(func_name, "idle_loop") || STREQ(func_name,
"hypercall")
+ || STREQ(func_name, "process_softirqs")
|| STREQ(func_name, "tracing_off")
|| STREQ(func_name, "handle_exception")) {
UPDATE_FRAME(func_name, pc, 0, sp, bp, asp, 0, 0, bp - sp,
0);
which shows:
crash> bt -a
PCPU: 0 VCPU: ffbc7080
#0 [ff1d3ebc] elf_core_save_regs at ff10a810
#1 [ff1d3ec4] common_interrupt at ff1222ed
#2 [ff1d3ed0] do_nmi at ff1335bb
#3 [ff1d3ef0] handle_nmi_mce at ff17442e
#4 [ff1d3f24] csched_tick at ff110aa7
#5 [ff1d3f80] timer_softirq_action at ff1155d2
#6 [ff1d3fa0] do_softirq at ff1143fe
#7 [ff1d3fb0] process_softirqs at ff173f61
PCPU: 1 VCPU: ff1b6080
...
The patch to avoid eframe search can be avoided entirely by applying
the second patch, but it seems that it should be left in place for
other unforeseen possibilities in the future.
Do you agree with these changes?
Thanks,
Dave