I ran a test on ~200 dumpfiles, and for the most part, the patch is
quite useful in replacing the "Oops" message with something more
helpful.
However, the "[Hardware Error]" check should be the very last thing checked.
Actually, I'm not even sure whether it should be checked at all, because there
are dozens of pr_emerg(HW_ERR ...) calls in the kernel, and it appears that they
don't all necessarily cause the kernel to crash.
For example, this sample vmcore currently correctly shows that the kernel has
crashed due to a BUG in mm/slab.c:
crash> sys
KERNEL: 2.6.32-220.el6.x86_64_slab_page_corruption/vmlinux.gz
DUMPFILE: 2.6.32-220.el6.x86_64_slab_page_corruption/musa_vmcore [PARTIAL DUMP]
CPUS: 32
DATE: Thu Feb 14 09:14:12 2013
UPTIME: 14:18:49
LOAD AVERAGE: 2.23, 1.94, 2.04
TASKS: 1621
NODENAME: musa
RELEASE: 2.6.32-220.el6.x86_64
VERSION: #1 SMP Wed Nov 9 08:03:13 EST 2011
MACHINE: x86_64 (2599 Mhz)
MEMORY: 128 GB
PANIC: "kernel BUG at mm/slab.c:533!"
crash> bt
PID: 159 TASK: ffff881018c2eac0 CPU: 28 COMMAND: "events/28"
#0 [ffff881018c359f0] machine_kexec at ffffffff81031fcb
#1 [ffff881018c35a50] crash_kexec at ffffffff810b8f72
#2 [ffff881018c35b20] oops_end at ffffffff814f04b0
#3 [ffff881018c35b50] die at ffffffff8100f26b
#4 [ffff881018c35b80] do_trap at ffffffff814efda4
#5 [ffff881018c35be0] do_invalid_op at ffffffff8100ce35
#6 [ffff881018c35c80] invalid_op at ffffffff8100bedb
[exception RIP: free_block+357]
RIP: ffffffff8115ffd5 RSP: ffff881018c35d30 RFLAGS: 00010006
RAX: ffffea00321db658 RBX: ffff880f5bc52c80 RCX: 0000000000000002
RDX: 004000000000006c RSI: ffff880fb58e9ac0 RDI: ffff880e51a1d000
RBP: ffff881018c35d80 R8: ffff880fb58e9ac0 R9: 0000000000000000
R10: 000000000000000c R11: 0000000000000000 R12: 0000000000000006
R13: ffff880ffaa95828 R14: 0000000000000002 R15: ffffea0000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff881018c35d88] drain_array at ffffffff81160211
#8 [ffff881018c35dd8] cache_reap at ffffffff81161210
#9 [ffff881018c35e38] worker_thread at ffffffff8108b2b0
#10 [ffff881018c35ee8] kthread at ffffffff81090886
#11 [ffff881018c35f48] kernel_thread at ffffffff8100c14a
crash>
With your patch applied, it incorrectly shows this:
crash> sys
KERNEL: 2.6.32-220.el6.x86_64_slab_page_corruption/vmlinux.gz
DUMPFILE: 2.6.32-220.el6.x86_64_slab_page_corruption/musa_vmcore [PARTIAL DUMP]
CPUS: 32
DATE: Thu Feb 14 09:14:12 2013
UPTIME: 14:18:49
LOAD AVERAGE: 2.23, 1.94, 2.04
TASKS: 1621
NODENAME: musa
RELEASE: 2.6.32-220.el6.x86_64
VERSION: #1 SMP Wed Nov 9 08:03:13 EST 2011
MACHINE: x86_64 (2599 Mhz)
MEMORY: 128 GB
PANIC: "[Hardware Error]: Machine check events logged"
I don't have a problem with the other parts of the patch.
I'll move the hardware error check to the bottom, and only use it if there
are no other relevant strings found, and then re-test that configuration.
Dave
----- Original Message -----
There are just too many kinds of panic types are categorized under
the same Oops: xxxx, makes this field really ambiguous and not so useful
PANIC: "Oops: 0000 [#1] SMP " (check log for details)
this patch separated 3 kinds of panicmsg out, as the most happening cases
among the machines managed by me; the match string are copied
from kernel source code exactly, after applied, I got panicmsg like:
include/linux/kernel.h:#define HW_ERR
panicmsg: "[Hardware Error]: CPU 7: Machine Check Exception: 5 Bank
11: f200003f000100b2"
drivers/char/sysrq.c:__handle_sysrq
panicmsg: "SysRq : Trigger a crash"
arch/x86/kernel/traps.c:do_general_protection
panicmsg: "general protection fault: 8800 [#1] SMP"
arch/x86/mm/fault.c:show_fault_oops
panicmsg: "BUG: unable to handle kernel paging request at
00001248a68eb328"
We need to move the SysRq matching lines to before matching "Oops", because
SysRq lines usually also has the Oops, need to take precedence for SysRq.
Signed-off-by: Derek Che <drc(a)yahoo-inc.com>
---
task.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/task.c b/task.c
index 4214d7f..1530e7b 100644
--- a/task.c
+++ b/task.c
@@ -5509,19 +5509,31 @@ get_panicmsg(char *buf)
}
rewind(pc->tmpfile);
while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) {
- if (strstr(buf, "Oops: ") ||
- strstr(buf, "kernel BUG at"))
- msg_found = TRUE;
+ if (strstr(buf, "[Hardware Error]: "))
+ msg_found = TRUE;
+ }
+ rewind(pc->tmpfile);
+ while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) {
+ if (strstr(buf, "general protection fault"))
+ msg_found = TRUE;
}
rewind(pc->tmpfile);
while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) {
if (strstr(buf, "SysRq : Netdump") ||
strstr(buf, "SysRq : Trigger a crashdump") ||
- strstr(buf, "SysRq : Crash")) {
+ strstr(buf, "SysRq : Crash") ||
+ strstr(buf, "SysRq : Trigger a crash")) {
pc->flags |= SYSRQ;
msg_found = TRUE;
}
}
+ rewind(pc->tmpfile);
+ while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) {
+ if (strstr(buf, "Oops: ") ||
+ strstr(buf, "kernel BUG at") ||
+ strstr(buf, "BUG: unable to handle kernel "))
+ msg_found = TRUE;
+ }
rewind(pc->tmpfile);
while (!msg_found && fgets(buf, BUFSIZE, pc->tmpfile)) {
if (strstr(buf, "sysrq") &&