Question for LKCD maintainers
by Dave Anderson
Long after I stopped tinkering with the LKCD code in crash,
changes were contributed to support physical memory zones
in the LKCD dumpfile format. Specifically there is this
piece of save_offset() in lkcd_common.c:
/* find the zone */
for (ii=0; ii < lkcd->num_zones; ii++) {
if (lkcd->zones[ii].start == zone) {
if (lkcd->zones[ii].pages[page].offset != 0) {
if (lkcd->zones[ii].pages[page].offset != off) {
error(INFO, "conflicting page: zone %lld, "
"page %lld: %lld, %lld != %lld\n",
(unsigned long long)zone,
(unsigned long long)page,
(unsigned long long)paddr,
(unsigned long long)off,
(unsigned long long) \
lkcd->zones[ii].pages[page].offset);
abort();
}
ret = 0;
} else {
lkcd->zones[ii].pages[page].offset = off;
ret = 1;
}
break;
}
}
The call to abort() above kills the crash session, which is both
annoying and unnecessary.
I am seeing it in a customer dumpfile, who have their own dumping scheme
that is based upon LKCD version 7. I understand that this may be a
problem with their LKCD port, but nonetheless, it's the only place in
the crash utility that doesn't recover gracefully from dumpfile access
errors.
Anyway, I would like to either:
1. change the error(INFO...) to error(FATAL...) so that run-time
commands encountering this error will just fail, and the session
will return to the crash> prompt, or
2. return 0, so that a "seek error" can be subsequently displayed
by the readmem() command.
Number 2 is preferable, because it yields more clues as to where the
readmem() came from, but since I don't know much about the LKCD
physical memory zones stuff, is there any reason that shouldn't
be done?
Thanks,
Dave
16 years, 10 months
[PATCH] Improve error handling when architecture doesn't match
by Bernhard Walle
Currently, crash prints always
crash: vmcore: not a supported file format
if you try to open a dump file which is not supported. However, it can be
misleading if you have a valid ELF core dump, but just use crash for the wrong
architecture. In the case I observed the user had a ELF64 x86 dump file and
assumed it's x86-64. However, it just was a i386 core dump which was ELF64
because kexec was called with --elf64-core-headers which makes sense
if the i386 machine has PAE and possibly more than 4 GiB of physical RAM.
After that patch is applied, an example output is
Looks like a valid ELF dump, but host architecture (X86_64) \
doesn't match dump architecture (IA64).
or if I try to open a PPC64 dump on x86-64:
Looks like a valid ELF dump, but host endianess (LE) \
doesn't match target endianess (BE)
Please review and consider applying.
Signed-off-by: Bernhard Walle <bwalle(a)suse.de>
---
defs.h | 3 ++-
netdump.c | 48 +++++++++++++++++++++++++++++++++++++++++++-----
tools.c | 9 ++++++++-
3 files changed, 53 insertions(+), 7 deletions(-)
--- a/defs.h
+++ b/defs.h
@@ -3198,7 +3198,8 @@ void stall(ulong);
char *pages_to_size(ulong, char *);
int clean_arg(void);
int empty_list(ulong);
-int machine_type(char *);
+int machine_type(const char *);
+int is_big_endian(void);
void command_not_supported(void);
void option_not_supported(int);
void please_wait(char *);
--- a/netdump.c
+++ b/netdump.c
@@ -36,6 +36,32 @@ static void check_dumpfile_size(char *);
#define ELFREAD 0
#define MIN_PAGE_SIZE (4096)
+
+
+/*
+ * Checks if the machine type of the host matches required_type.
+ * If not, it prints a short error message for the user.
+ */
+static int machine_type_error(const char *required_type)
+{
+ if (machine_type(required_type))
+ return 1;
+ else {
+ fprintf(stderr, "Looks like a valid ELF dump, but host "
+ "architecture (%s) doesn't match dump "
+ "architecture (%s).\n",
+ MACHINE_TYPE, required_type);
+ return 0;
+ }
+}
+
+/*
+ * Returns endianess in a string
+ */
+static const char *endianess_to_string(int big_endian)
+{
+ return big_endian ? "BE" : "LE";
+}
/*
* Determine whether a file is a netdump/diskdump/kdump creation,
@@ -98,6 +124,18 @@ is_netdump(char *file, ulong source_quer
* If either kdump difference is seen, presume kdump -- this
* is obviously subject to change.
*/
+
+ /* check endianess */
+ if ((STRNEQ(elf32->e_ident, ELFMAG) || STRNEQ(elf64->e_ident, ELFMAG)) &&
+ (elf32->e_type == ET_CORE || elf64->e_type == ET_CORE) &&
+ (elf32->e_ident[EI_DATA] == ELFDATA2LSB && is_big_endian()) ||
+ (elf32->e_ident[EI_DATA] == ELFDATA2MSB && !is_big_endian()))
+ fprintf(stderr, "Looks like a valid ELF dump, but host "
+ "endianess (%s) doesn't match target "
+ "endianess (%s)\n",
+ endianess_to_string(is_big_endian()),
+ endianess_to_string(elf32->e_ident[EI_DATA] == ELFDATA2MSB));
+
if (STRNEQ(elf32->e_ident, ELFMAG) &&
(elf32->e_ident[EI_CLASS] == ELFCLASS32) &&
(elf32->e_ident[EI_DATA] == ELFDATA2LSB) &&
@@ -108,7 +146,7 @@ is_netdump(char *file, ulong source_quer
switch (elf32->e_machine)
{
case EM_386:
- if (machine_type("X86"))
+ if (machine_type_error("X86"))
break;
default:
goto bailout;
@@ -133,28 +171,28 @@ is_netdump(char *file, ulong source_quer
{
case EM_IA_64:
if ((elf64->e_ident[EI_DATA] == ELFDATA2LSB) &&
- machine_type("IA64"))
+ machine_type_error("IA64"))
break;
else
goto bailout;
case EM_PPC64:
if ((elf64->e_ident[EI_DATA] == ELFDATA2MSB) &&
- machine_type("PPC64"))
+ machine_type_error("PPC64"))
break;
else
goto bailout;
case EM_X86_64:
if ((elf64->e_ident[EI_DATA] == ELFDATA2LSB) &&
- machine_type("X86_64"))
+ machine_type_error("X86_64"))
break;
else
goto bailout;
case EM_386:
if ((elf64->e_ident[EI_DATA] == ELFDATA2LSB) &&
- machine_type("X86"))
+ machine_type_error("X86"))
break;
else
goto bailout;
--- a/tools.c
+++ b/tools.c
@@ -4518,11 +4518,18 @@ empty_list(ulong list_head_addr)
}
int
-machine_type(char *type)
+machine_type(const char *type)
{
return STREQ(MACHINE_TYPE, type);
}
+int
+is_big_endian(void)
+{
+ unsigned short value = 0xff;
+ return *((unsigned char *)&value) != 0xff;
+}
+
void
command_not_supported()
{
16 years, 11 months
x86 backtrace is dependent upon struct pt_regs at compile time
by Alan Tyson
This problem has been reported before, but the discussion on it seemed
to move off track and I don't think that anyone really found the root cause.
The problem is that the x86 backtrace functionality in crash is
dependent upon the struct pt_regs taken from <asm/ptrace.h> at compile
time. struct pt_regs changed in 2.6.20. The result of this is that if
crash is compiled on 2.6.20 or later and subsequently used to look at a
2.6.19 or earlier dump, then exception frames are incorrectly displayed
and backtraces stop at them.
Here is an example of a 2.6.22-compiled crash displaying a trace from a
RHEL5 (2.6.18) dump:
crash> bt
PID: 3490 TASK: f7f5a000 CPU: 0 COMMAND: "insmod"
#0 [f664ddd0] crash_kexec at c0441c78
#1 [f664de14] die at c04064a4
#2 [f664de44] do_page_fault at c0605eea
#3 [f664de94] error_code (via page_fault) at c0405a6f
EAX: 00000000 EBX: f8dd3400 ECX: 00200082 EDX: 00200000
DS: 007b ESI: f7bbeab0 ES: 007b EDI: f7bbe800
SS: ffffe800 ESP: 00000000 EBP: f7bbead8
CS: 0060 EIP: f8dd300d ERR: ffffffff EFLAGS: 00210296
crash>
Note that in the above, crash thinks that the exception frame is a user
mode one and not a kernel frame.
If crash was compiled on RHEL5 (2.6.18), then the trace looks like this:
crash> bt
PID: 3490 TASK: f7f5a000 CPU: 0 COMMAND: "insmod"
#0 [f664ddd0] crash_kexec at c0441c78
#1 [f664de14] die at c04064a4
#2 [f664de44] do_page_fault at c0605eea
#3 [f664de94] error_code (via page_fault) at c0405a6f
EAX: 00000000 EBX: f8dd3400 ECX: 00200082 EDX: 00200000 EBP:
f7bbead8
DS: 007b ESI: f7bbeab0 ES: 007b EDI: f7bbe800
CS: 0060 EIP: f8dd300d ERR: ffffffff EFLAGS: 00210296
#4 [f664dec8] function2 at f8dd300d
#5 [f664dee0] sys_init_module at c043e717
#6 [f664dfb8] system_call at c0404ef8
EAX: ffffffda EBX: 0861a028 ECX: 00010144 EDX: 0861a018
DS: 007b ESI: 00000000 ES: 007b EDI: 00307ff4
SS: 007b ESP: bfe5695c EBP: bfe569a8
CS: 0073 EIP: 00d37402 ERR: 00000080 EFLAGS: 00200206
crash>
A similar problem happens if crash is compiled on pre-2.6.20 and then
used to analyse a 2.6.20 or later dump.
Dave, I have attached a patch to this e-mail which removes the
dependence upon <asm/prtrace.h> from lkcd_x86_trace.c (which is used for
non-LKCD dumps as well as LKCD dumps by the way). I notice that
eframe_init() in x86.c initialises several variables which correspond to
the struct pt_regs so I've had to make these external for
lkcd_x86_trace.c's use. I have no problem in this being reworked if you
feel that these symbols really should be in defs.h (or any other rework
that you think is fit, for that matter).
Regards,
Alan Tyson, HP.
16 years, 11 months
Re: [Crash-utility] problems running crash on recent rawhide live kernels
by Dave Anderson
> Jeff Layton wrote:
> > Relevant packages:
> >
> > kernel-2.6.24-0.62.rc3.git5.fc9.x86_64
> > kernel-debuginfo-2.6.24-0.62.rc3.git5.fc9.x86_64
> > crash-4.0-4.10.x86_64
> >
> > ... the host is a FV xen guest (but that shouldn't matter, should
> > it?).
To get crash version 4.0-4.11 to run against that particular
dumpfile, it needs to know the kernel's "phys_base" relocation
value. And I don't know how (or if it's even possible) to get
it from a fully-virtualized Xen guest dumpfile. However, if
you run crash on the live on the kernel that panicked, you can
determine it. So running live on kernel-2.6.24-0.62.rc3.git5.fc9
I see:
crash> help -m | grep phys_base
phys_base: ffffffffff200000
crash>
...which in turn can be used as a command line argument for the
xendump dumpfile from that kernel. So taking the sample dumpfile
you gave me:
# crash --machdep phys_base=0xffffffffff200000 vmlinux vmcore-rawhide.xmdump
crash 4.0-4.11
Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
NOTE: setting phys_base to: 0xffffffffff200000
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
KERNEL: vmlinux
DUMPFILE: vmcore-rawhide.xmdump
CPUS: 1
DATE: Tue Dec 4 15:41:08 2007
UPTIME: 06:10:51
LOAD AVERAGE: 0.00, 0.00, 0.00
TASKS: 74
NODENAME: dhcp231-229.rdu.redhat.com
RELEASE: 2.6.24-0.62.rc3.git5.fc9
VERSION: #1 SMP Sat Dec 1 13:59:08 EST 2007
MACHINE: x86_64 (3458 Mhz)
MEMORY: 511.6 MB
PANIC: "SysRq : Trigger a crashdump"
PID: 0
COMMAND: "swapper"
TASK: ffffffff813a1780 [THREAD_INFO: ffffffff81496000]
CPU: 0
STATE: TASK_RUNNING (ACTIVE)
crash>
Pain in the ass. But I don't know any better way.
Dave
16 years, 11 months
crash version 4.0-4.12 is available
by Dave Anderson
- Fix for the "kmem -n" command to handle the 2.6.24 kernel replacement
of the "node_online_map" nodemask with its appropriate entry in the
new "node_states[]" nodemask array. Without the patch, the per-node
zone data would not be displayed, and any commands depending upon
the node table data would be affected. (anderson(a)redhat.com)
- Fix for "kmem -p" on 2.6.24 x86_64 kernels that are configured with
CONFIG_SPARSEMEM_VMEMMAP, which use a virtually-mapped page struct
array. Without the patch, the virtual-to-physical translation of
each page structure was invalid, and "kmem -p" would display invalid
data. This would also affect other commands as well, such as the
output of "kmem -i", and the output of a "vtop" command on a mapped
page address. Also, the virtual base address of the region is now
displayed by the "mach" command.
(oomichi(a)mxs.nes.nec.co.jp, anderson(a)redhat.com)
- Fix for the "dev" command's character device name string output to
recognize the change of the name structure member from a pointer
to an embedded string. Without the patch, 2.6.16 and later kernels
would display "(unknown)" character device names.
(olivier.daudel(a)u-paris10.fr, anderson(a)redhat.com)
- Fix for the "kmem -[sS]" command to handle the 2.6.24 change to
the CONFIG_SLUB kmem_cache structure, which re-worked the manner
in which the per-cpu slabs get referenced. Without the patch,
the command would fail with several error messages of the type:
"kmem: page_to_nid: invalid page: ffff81003993f4b0".
(anderson(a)redhat.com)
- Fix for the "kmem -[fF]" command to handle the 2.6.24 kernel change
of the free_area struct, which replaced the singular linked list
of pages with 5 (MIGRATE_TYPES) linked lists. Without the patch,
the command would fail with the error message: "kmem: unrecognized
free_area struct size: 88". (anderson(a)redhat.com)
- Fix for the "runq" command to handle the 2.6.24 kernel change to
the CFS scheduler that introduced per-cpu init_cfs_rq structures
for task group scheduling. Without the patch, no queued tasks
were displayed, because the rb_root of queued tasks was being
taken from the embedded cfs_rq in each per-cpu runqueue.
(anderson(a)redhat.com)
Download from: http://people.redhat.com/anderson
16 years, 11 months
Patch for command dev
by Olivier Daudel
Hello Dave,
A small patch par dev.c.
If i am correct, with 2.6.16, name in chrdevs becomes a table.
crash> dev
CHRDEV NAME OPERATIONS
1 (unknown) (none)
4 (unknown) (none)
4 (unknown) (none)
4 (unknown) (none)
5 (unknown) (none)
With the patch :
crash> dev
CHRDEV NAME OPERATIONS
1 mem (none)
4 /dev/vc/0 (none)
4 tty (none)
4 ttyS (none)
5 /dev/tty (none)
--- crash-4.0-4.11/dev.c 2007-12-06 16:47:06.000000000 +0100
+++ crash-4.0-4.11-change/dev.c 2007-12-10 17:13:30.000000000 +0100
@@ -202,7 +202,9 @@
name = ULONG(char_device_struct_buf +
OFFSET(char_device_struct_name));
if (name) {
- if (!read_string(name, buf, BUFSIZE-1))
+ if (THIS_KERNEL_VERSION >= LINUX(2,6,16))
+
sprintf(buf,char_device_struct_buf+OFFSET(char_device_struct_name));
+ else if (!read_string(name, buf, BUFSIZE-1))
sprintf(buf, "(unknown)");
} else
sprintf(buf, "(unknown)");
@@ -244,7 +246,9 @@
name = ULONG(char_device_struct_buf +
OFFSET(char_device_struct_name));
if (name) {
- if (!read_string(name, buf, BUFSIZE-1))
+ if (THIS_KERNEL_VERSION >= LINUX(2,6,16))
+
sprintf(buf,char_device_struct_buf+OFFSET(char_device_struct_name));
+ else if (!read_string(name, buf, BUFSIZE-1))
sprintf(buf, "(unknown)");
} else
sprintf(buf, "(unknown)");
----------------------------------------------------------------
Ce message a ete envoye par IMP, grace a l'Universite Paris 10 Nanterre
16 years, 11 months
Right way to display contents of memory[crash on ia64]
by Dheeraj Sangamkar
Hi,
I am using crash 4.0-2.30 on an ia64 machine.
The memory dump of the stack shows parameters on the stack, one of which is
a user space pointer.
e00000014c930ed8: __gp v+4643276848
e00000014c930ee8: 60000fffffffb390 00000000000000ff
e00000014c930ef8: v+4643276864 v+5579701608
e00000014c930f08: sys_readlink+480 0000000000000792
OR
e00000014c930ed8: a0000001009bb820 e000000114c2c830 .......0.......
e00000014c930ee8: 60000fffffffb390 00000000000000ff .......`........
e00000014c930ef8: e000000114c2c840 e00000014c937d68 @.......h}.L....
e00000014c930f08: a00000010013da60 0000000000000792 `...............
I want to find what the parameter v+4643276848/e000000114c2c830 points to.
I used rd to print this but I dont see what I expect. (Used "rd
e000000114c2c830 10")
What's the right way to inspect that memory?
Dheeraj
16 years, 11 months
Heads up: crash command errors with 2.6.24 kernels
by Dave Anderson
It should be noted that while version 4.0-4.11 will at least allow
a crash session to initialize, there are several other 2.6.24 related
kernel changes that have broken several key commands. Among them, at
least on x86_64 kernels:
1. "kmem -[sS]" fails due to changes in the CONFIG_SLUB code between
2.6.22 and 2.6.24.
2. "kmem <address>" doesn't work at all.
3. "kmem -n" fails to show any pgdat-node related information.
4. "kmem -f" doesn't work at all.
5. "kmem -i" doesn't work at all.
6. "runq" for the CFS scheduler no longer shows any queued tasks,
but only the relevant structure addresses.
7. The kernel's use of a virtual mem_map array on x86_64 is not
handled, and this may lead to other page struct related errors.
Dave
16 years, 11 months
crash version 4.0-4.11 is available
by Dave Anderson
- Fix for task-gathering to handle the 2.6.24 pid_namespace-related
changes to the kernel pid_hash array. Without the patch, the crash
session fails during initialization with the message "crash: cannot
gather a stable task list via pid_hash (500 retries)".
(anderson(a)redhat.com)
- Fix for "kmem -f <address>" and "kmem <address>" commands on
x86 kernels, which may incorrectly indicate that the address is in
the kernel's free page list. Without this patch, if the address
argument is a physical address over 4GB, or a page struct address
referencing a physical address over 4GB, it is possible that the
address would incorrectly be shown as being in the kernel's free
page list. (anderson(a)redhat.com)
- Fix for x86 "bt" command for active tasks in Egenera dumpfiles
based upon LKCD version 7. Without the patch, the starting points
for the active task backtraces were erroneous.
(anderson(a)redhat.com)
- Fix for a potential segmentation violation during crash session
initialization if a task's kernel stack has been completely overrun,
corrupting its thread_info structure at the bottom of the stack.
This could occur running against kernels from 2.6.8 through 2.6.18.
With the patch, the suspect task will be reported during the task
initialization sequence. (anderson(a)redhat.com)
- Fix for "kmem -S" error message if a slab object is found in both
a per-cpu list and on a slab's global free list. Without the patch,
the object address and cpu number values are flip-flopped in the
error message. (bob.montgomery(a)hp.com)
Download from: http://people.redhat.com/anderson
16 years, 11 months
typo affects kmem -S error output
by Bob Montgomery
Dave,
This patch fixes a typo in memory.c.
Before:
=======
crash> kmem -S sctp_bind_bucket
...
kmem: "sctp_bind_bucket" cache: object 0 on both free and cpu 651223584
lists
...
(Note cpu number)
After:
======
crash> kmem -S sctp_bind_bucket
...
kmem: "sctp_bind_bucket" cache: object ffff810126d0e220 on both free and
cpu 0 lists
...
Bob Montgomery
Working at HP in Fort Collins
16 years, 12 months