Dave,
Adding --no_elf_notes to the crash invocation does indeed start crash
with without issue. Do you think that I am dealing with a
corrupted/incomplete vmcore (as evident in that extremely large n_descsz
value) or is this a bug that crash could more gracefully handle?
As far as the kernel is concerned,
2.6.32-131.0.15.el6.exp10.bz16586.x86_64 was a stock RH 2.6.32-131.0.15
with an added patch for handling an MD Raid bug (RHBZ-707268). Stratus
does load a driver to track dirty VM pages for harvesting purposes, but
does not change general VM behavior.
FWIW, this is the only vmcore that I've seen ELF note faulting or
invalid section numbers.
Thanks,
-- Joe
-----Original Message-----
From: crash-utility-bounces(a)redhat.com
[mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
Sent: Wednesday, September 28, 2011 5:15 PM
To: Discussion list for crash utility usage,maintenance and development
Subject: Re: [Crash-utility] Crash faults when determining panic task
Hi Joe,
It pretty clear it's due to this change in 5.1.5:
- Implemented the capability of using the NT_PRSTATUS ELF note
data
that is saved in version 4 compressed kdump headers to
determine the
starting stack and instruction pointer hooks for x86 and
x86_64
backtraces when they cannot be determined in the traditional
manners.
(wang.chao(a)cn.fujitsu.com, wency(a)cn.fujitsu.com)
What happens if you run it like so:
$ crash --no_elf_notes vmlinux vmcore
As far as this message:
WARNING: sparsemem: invalid section number: 137438888923
That should be outside the realm of Fujitsu's ELF notes patch. Does
this kernel
have some kind of Stratus VM modification?
Dave
----- Original Message -----
Crash faults when determining panic task
I have a vmcore generated on RHEL6.1 that newer versions of crash
have trouble analyzing (5.1.1-2.el6 seems to work ok) .
I can provide additional binary files if needed, just let me know
what convention best suits the list (ftp, private email attachment,
etc.)
Crash Version : OS: Result:
crash 5.1.8 Debian wheezy faults
crash 5.1.7-1.el6 RHEL6.2 Alpha faults
crash 5.1.1-2.el6 RHEL6.1 ok
Kernel:
2.6.32-131.0.15.el6.exp10.bz16586.x86_64 ( 2.6.32-131.0.15 + a fix
for Red Hat bz - 707268)
Interesting warnings when starting crash:
WARNING: sparsemem: invalid section number: 137438888923
WARNING: sparsemem: invalid section number: 137438888923
First fault, null pointer deference:
please wait... (determining panic task)
Program received signal SIGSEGV, Segmentation fault.
x86_64_get_dumpfile_stack_frame (rsp=0x7fffffffcc58,
rip=0x7fffffffcc50,
bt_in=0x7fffffffcce0) at x86_64.c:4183
4183 ur_rip = ULONG(user_regs +
(gdb) p user_regs
$1 = 0x0
Workaround, check that bt->machdep is not NULL:
diff -Nupr crash-5.1.8/x86_64.c crash-5.1.8.new/x86_64.c
--- crash-5.1.8/x86_64.c 2011-09-16 15:01:12.000000000 -0400
+++ crash-5.1.8.new/x86_64.c 2011-09-28 14:12:45.347188571 -0400
@@ -4178,7 +4178,7 @@ x86_64_get_dumpfile_stack_frame(struct b
goto skip_stage;
}
}
- } else if (ELF_NOTES_VALID()) {
+ } else if (ELF_NOTES_VALID() && bt->machdep) {
user_regs = bt->machdep;
ur_rip = ULONG(user_regs +
OFFSET(user_regs_struct_rip));
Second fault, a curiously large n_descsz in elf note header:
please wait... (determining panic task)
Program received signal SIGSEGV, Segmentation fault.
get_regs_from_note (note=0xd26472 "\b", ip=0x7fffffffc4e0,
sp=0x7fffffffc4e8)
at netdump.c:2221
2221 *sp = ULONG(user_regs + offset_sp);
(gdb) p *(Elf64_Nhdr *)note
$1 = {n_namesz = 8, n_descsz = 3438804992, n_type = 8}
Workaround, do not attempt reading registers from elf notes (this
chunk of code was not present in crash 5.1.1):
diff -Nupr crash-5.1.8/netdump.c crash-5.1.8.new/netdump.c
--- crash-5.1.8/netdump.c 2011-09-16 15:01:12.000000000 -0400
+++ crash-5.1.8.new/netdump.c 2011-09-28 14:14:43.687183734 -0400
@@ -2286,7 +2286,7 @@ get_netdump_regs_x86_64(struct bt_info *
bt->machdep = (void *)user_regs;
}
-
+#if 0
if (ELF_NOTES_VALID() &&
(bt->flags & BT_DUMPFILE_SEARCH) && DISKDUMP_DUMPFILE() &&
(note = (Elf64_Nhdr *)
@@ -2305,7 +2305,7 @@ get_netdump_regs_x86_64(struct bt_info *
bt->machdep = (void *)user_regs;
}
-
+#endif
machdep->get_stack_frame(bt, ripp, rspp); }
Given the warning messages at the beginning of the process, I'm sure
if I' m dealing with a corrupted or incomplete vmcore image. Let me
know what additional info could be useful if this seems worth
debugging further.
Thanks,
-- Joe Lawrence
--
Crash-utility mailing list
Crash-utility(a)redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
--
Crash-utility mailing list
Crash-utility(a)redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility