December 2007 - Crash-utility - Crash Utility List Archives

by Dave Anderson

Long after I stopped tinkering with the LKCD code in crash, changes were contributed to support physical memory zones in the LKCD dumpfile format. Specifically there is this piece of save_offset() in lkcd_common.c: /* find the zone */ for (ii=0; ii < lkcd->num_zones; ii++) { if (lkcd->zones[ii].start == zone) { if (lkcd->zones[ii].pages[page].offset != 0) { if (lkcd->zones[ii].pages[page].offset != off) { error(INFO, "conflicting page: zone %lld, " "page %lld: %lld, %lld != %lld\n", (unsigned long long)zone, (unsigned long long)page, (unsigned long long)paddr, (unsigned long long)off, (unsigned long long) \ lkcd->zones[ii].pages[page].offset); abort(); } ret = 0; } else { lkcd->zones[ii].pages[page].offset = off; ret = 1; } break; } } The call to abort() above kills the crash session, which is both annoying and unnecessary. I am seeing it in a customer dumpfile, who have their own dumping scheme that is based upon LKCD version 7. I understand that this may be a problem with their LKCD port, but nonetheless, it's the only place in the crash utility that doesn't recover gracefully from dumpfile access errors. Anyway, I would like to either: 1. change the error(INFO...) to error(FATAL...) so that run-time commands encountering this error will just fail, and the session will return to the crash> prompt, or 2. return 0, so that a "seek error" can be subsequently displayed by the readmem() command. Number 2 is preferable, because it yields more clues as to where the readmem() came from, but since I don't know much about the LKCD physical memory zones stuff, is there any reason that shouldn't be done? Thanks, Dave

17 years, 3 months

3
4
0 / 0

[PATCH] Improve error handling when architecture doesn't match

by Bernhard Walle

Currently, crash prints always crash: vmcore: not a supported file format if you try to open a dump file which is not supported. However, it can be misleading if you have a valid ELF core dump, but just use crash for the wrong architecture. In the case I observed the user had a ELF64 x86 dump file and assumed it's x86-64. However, it just was a i386 core dump which was ELF64 because kexec was called with --elf64-core-headers which makes sense if the i386 machine has PAE and possibly more than 4 GiB of physical RAM. After that patch is applied, an example output is Looks like a valid ELF dump, but host architecture (X86_64) \ doesn't match dump architecture (IA64). or if I try to open a PPC64 dump on x86-64: Looks like a valid ELF dump, but host endianess (LE) \ doesn't match target endianess (BE) Please review and consider applying. Signed-off-by: Bernhard Walle <bwalle(a)suse.de> --- defs.h | 3 ++- netdump.c | 48 +++++++++++++++++++++++++++++++++++++++++++----- tools.c | 9 ++++++++- 3 files changed, 53 insertions(+), 7 deletions(-) --- a/defs.h +++ b/defs.h @@ -3198,7 +3198,8 @@ void stall(ulong); char *pages_to_size(ulong, char *); int clean_arg(void); int empty_list(ulong); -int machine_type(char *); +int machine_type(const char *); +int is_big_endian(void); void command_not_supported(void); void option_not_supported(int); void please_wait(char *); --- a/netdump.c +++ b/netdump.c @@ -36,6 +36,32 @@ static void check_dumpfile_size(char *); #define ELFREAD 0 #define MIN_PAGE_SIZE (4096) + + +/* + * Checks if the machine type of the host matches required_type. + * If not, it prints a short error message for the user. + */ +static int machine_type_error(const char *required_type) +{ + if (machine_type(required_type)) + return 1; + else { + fprintf(stderr, "Looks like a valid ELF dump, but host " + "architecture (%s) doesn't match dump " + "architecture (%s).\n", + MACHINE_TYPE, required_type); + return 0; + } +} + +/* + * Returns endianess in a string + */ +static const char *endianess_to_string(int big_endian) +{ + return big_endian ? "BE" : "LE"; +} /* * Determine whether a file is a netdump/diskdump/kdump creation, @@ -98,6 +124,18 @@ is_netdump(char *file, ulong source_quer * If either kdump difference is seen, presume kdump -- this * is obviously subject to change. */ + + /* check endianess */ + if ((STRNEQ(elf32->e_ident, ELFMAG) || STRNEQ(elf64->e_ident, ELFMAG)) && + (elf32->e_type == ET_CORE || elf64->e_type == ET_CORE) && + (elf32->e_ident[EI_DATA] == ELFDATA2LSB && is_big_endian()) || + (elf32->e_ident[EI_DATA] == ELFDATA2MSB && !is_big_endian())) + fprintf(stderr, "Looks like a valid ELF dump, but host " + "endianess (%s) doesn't match target " + "endianess (%s)\n", + endianess_to_string(is_big_endian()), + endianess_to_string(elf32->e_ident[EI_DATA] == ELFDATA2MSB)); + if (STRNEQ(elf32->e_ident, ELFMAG) && (elf32->e_ident[EI_CLASS] == ELFCLASS32) && (elf32->e_ident[EI_DATA] == ELFDATA2LSB) && @@ -108,7 +146,7 @@ is_netdump(char *file, ulong source_quer switch (elf32->e_machine) { case EM_386: - if (machine_type("X86")) + if (machine_type_error("X86")) break; default: goto bailout; @@ -133,28 +171,28 @@ is_netdump(char *file, ulong source_quer { case EM_IA_64: if ((elf64->e_ident[EI_DATA] == ELFDATA2LSB) && - machine_type("IA64")) + machine_type_error("IA64")) break; else goto bailout; case EM_PPC64: if ((elf64->e_ident[EI_DATA] == ELFDATA2MSB) && - machine_type("PPC64")) + machine_type_error("PPC64")) break; else goto bailout; case EM_X86_64: if ((elf64->e_ident[EI_DATA] == ELFDATA2LSB) && - machine_type("X86_64")) + machine_type_error("X86_64")) break; else goto bailout; case EM_386: if ((elf64->e_ident[EI_DATA] == ELFDATA2LSB) && - machine_type("X86")) + machine_type_error("X86")) break; else goto bailout; --- a/tools.c +++ b/tools.c @@ -4518,11 +4518,18 @@ empty_list(ulong list_head_addr) } int -machine_type(char *type) +machine_type(const char *type) { return STREQ(MACHINE_TYPE, type); } +int +is_big_endian(void) +{ + unsigned short value = 0xff; + return *((unsigned char *)&value) != 0xff; +} + void command_not_supported() {

17 years, 4 months

2
13
0 / 0

x86 backtrace is dependent upon struct pt_regs at compile time

by Alan Tyson

This problem has been reported before, but the discussion on it seemed to move off track and I don't think that anyone really found the root cause. The problem is that the x86 backtrace functionality in crash is dependent upon the struct pt_regs taken from <asm/ptrace.h> at compile time. struct pt_regs changed in 2.6.20. The result of this is that if crash is compiled on 2.6.20 or later and subsequently used to look at a 2.6.19 or earlier dump, then exception frames are incorrectly displayed and backtraces stop at them. Here is an example of a 2.6.22-compiled crash displaying a trace from a RHEL5 (2.6.18) dump: crash> bt PID: 3490 TASK: f7f5a000 CPU: 0 COMMAND: "insmod" #0 [f664ddd0] crash_kexec at c0441c78 #1 [f664de14] die at c04064a4 #2 [f664de44] do_page_fault at c0605eea #3 [f664de94] error_code (via page_fault) at c0405a6f EAX: 00000000 EBX: f8dd3400 ECX: 00200082 EDX: 00200000 DS: 007b ESI: f7bbeab0 ES: 007b EDI: f7bbe800 SS: ffffe800 ESP: 00000000 EBP: f7bbead8 CS: 0060 EIP: f8dd300d ERR: ffffffff EFLAGS: 00210296 crash> Note that in the above, crash thinks that the exception frame is a user mode one and not a kernel frame. If crash was compiled on RHEL5 (2.6.18), then the trace looks like this: crash> bt PID: 3490 TASK: f7f5a000 CPU: 0 COMMAND: "insmod" #0 [f664ddd0] crash_kexec at c0441c78 #1 [f664de14] die at c04064a4 #2 [f664de44] do_page_fault at c0605eea #3 [f664de94] error_code (via page_fault) at c0405a6f EAX: 00000000 EBX: f8dd3400 ECX: 00200082 EDX: 00200000 EBP: f7bbead8 DS: 007b ESI: f7bbeab0 ES: 007b EDI: f7bbe800 CS: 0060 EIP: f8dd300d ERR: ffffffff EFLAGS: 00210296 #4 [f664dec8] function2 at f8dd300d #5 [f664dee0] sys_init_module at c043e717 #6 [f664dfb8] system_call at c0404ef8 EAX: ffffffda EBX: 0861a028 ECX: 00010144 EDX: 0861a018 DS: 007b ESI: 00000000 ES: 007b EDI: 00307ff4 SS: 007b ESP: bfe5695c EBP: bfe569a8 CS: 0073 EIP: 00d37402 ERR: 00000080 EFLAGS: 00200206 crash> A similar problem happens if crash is compiled on pre-2.6.20 and then used to analyse a 2.6.20 or later dump. Dave, I have attached a patch to this e-mail which removes the dependence upon <asm/prtrace.h> from lkcd_x86_trace.c (which is used for non-LKCD dumps as well as LKCD dumps by the way). I notice that eframe_init() in x86.c initialises several variables which correspond to the struct pt_regs so I've had to make these external for lkcd_x86_trace.c's use. I have no problem in this being reworked if you feel that these symbols really should be in defs.h (or any other rework that you think is fit, for that matter). Regards, Alan Tyson, HP.

17 years, 4 months

2
1
0 / 0

Re: [Crash-utility] problems running crash on recent rawhide live kernels

by Dave Anderson

> Jeff Layton wrote: > > Relevant packages: > > > > kernel-2.6.24-0.62.rc3.git5.fc9.x86_64 > > kernel-debuginfo-2.6.24-0.62.rc3.git5.fc9.x86_64 > > crash-4.0-4.10.x86_64 > > > > ... the host is a FV xen guest (but that shouldn't matter, should > > it?). To get crash version 4.0-4.11 to run against that particular dumpfile, it needs to know the kernel's "phys_base" relocation value. And I don't know how (or if it's even possible) to get it from a fully-virtualized Xen guest dumpfile. However, if you run crash on the live on the kernel that panicked, you can determine it. So running live on kernel-2.6.24-0.62.rc3.git5.fc9 I see: crash> help -m | grep phys_base phys_base: ffffffffff200000 crash> ...which in turn can be used as a command line argument for the xendump dumpfile from that kernel. So taking the sample dumpfile you gave me: # crash --machdep phys_base=0xffffffffff200000 vmlinux vmcore-rawhide.xmdump crash 4.0-4.11 Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007 Red Hat, Inc. Copyright (C) 2004, 2005, 2006 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. NOTE: setting phys_base to: 0xffffffffff200000 GNU gdb 6.1 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: vmlinux DUMPFILE: vmcore-rawhide.xmdump CPUS: 1 DATE: Tue Dec 4 15:41:08 2007 UPTIME: 06:10:51 LOAD AVERAGE: 0.00, 0.00, 0.00 TASKS: 74 NODENAME: dhcp231-229.rdu.redhat.com RELEASE: 2.6.24-0.62.rc3.git5.fc9 VERSION: #1 SMP Sat Dec 1 13:59:08 EST 2007 MACHINE: x86_64 (3458 Mhz) MEMORY: 511.6 MB PANIC: "SysRq : Trigger a crashdump" PID: 0 COMMAND: "swapper" TASK: ffffffff813a1780 [THREAD_INFO: ffffffff81496000] CPU: 0 STATE: TASK_RUNNING (ACTIVE) crash> Pain in the ass. But I don't know any better way. Dave

17 years, 4 months

2
2
0 / 0

crash version 4.0-4.12 is available

by Dave Anderson

- Fix for the "kmem -n" command to handle the 2.6.24 kernel replacement of the "node_online_map" nodemask with its appropriate entry in the new "node_states[]" nodemask array. Without the patch, the per-node zone data would not be displayed, and any commands depending upon the node table data would be affected. (anderson(a)redhat.com) - Fix for "kmem -p" on 2.6.24 x86_64 kernels that are configured with CONFIG_SPARSEMEM_VMEMMAP, which use a virtually-mapped page struct array. Without the patch, the virtual-to-physical translation of each page structure was invalid, and "kmem -p" would display invalid data. This would also affect other commands as well, such as the output of "kmem -i", and the output of a "vtop" command on a mapped page address. Also, the virtual base address of the region is now displayed by the "mach" command. (oomichi(a)mxs.nes.nec.co.jp, anderson(a)redhat.com) - Fix for the "dev" command's character device name string output to recognize the change of the name structure member from a pointer to an embedded string. Without the patch, 2.6.16 and later kernels would display "(unknown)" character device names. (olivier.daudel(a)u-paris10.fr, anderson(a)redhat.com) - Fix for the "kmem -[sS]" command to handle the 2.6.24 change to the CONFIG_SLUB kmem_cache structure, which re-worked the manner in which the per-cpu slabs get referenced. Without the patch, the command would fail with several error messages of the type: "kmem: page_to_nid: invalid page: ffff81003993f4b0". (anderson(a)redhat.com) - Fix for the "kmem -[fF]" command to handle the 2.6.24 kernel change of the free_area struct, which replaced the singular linked list of pages with 5 (MIGRATE_TYPES) linked lists. Without the patch, the command would fail with the error message: "kmem: unrecognized free_area struct size: 88". (anderson(a)redhat.com) - Fix for the "runq" command to handle the 2.6.24 kernel change to the CFS scheduler that introduced per-cpu init_cfs_rq structures for task group scheduling. Without the patch, no queued tasks were displayed, because the rb_root of queued tasks was being taken from the embedded cfs_rq in each per-cpu runqueue. (anderson(a)redhat.com) Download from: http://people.redhat.com/anderson

17 years, 4 months

1
0
0 / 0

Patch for command dev

by Olivier Daudel

Hello Dave, A small patch par dev.c. If i am correct, with 2.6.16, name in chrdevs becomes a table. crash> dev CHRDEV NAME OPERATIONS 1 (unknown) (none) 4 (unknown) (none) 4 (unknown) (none) 4 (unknown) (none) 5 (unknown) (none) With the patch : crash> dev CHRDEV NAME OPERATIONS 1 mem (none) 4 /dev/vc/0 (none) 4 tty (none) 4 ttyS (none) 5 /dev/tty (none) --- crash-4.0-4.11/dev.c 2007-12-06 16:47:06.000000000 +0100 +++ crash-4.0-4.11-change/dev.c 2007-12-10 17:13:30.000000000 +0100 @@ -202,7 +202,9 @@ name = ULONG(char_device_struct_buf + OFFSET(char_device_struct_name)); if (name) { - if (!read_string(name, buf, BUFSIZE-1)) + if (THIS_KERNEL_VERSION >= LINUX(2,6,16)) + sprintf(buf,char_device_struct_buf+OFFSET(char_device_struct_name)); + else if (!read_string(name, buf, BUFSIZE-1)) sprintf(buf, "(unknown)"); } else sprintf(buf, "(unknown)"); @@ -244,7 +246,9 @@ name = ULONG(char_device_struct_buf + OFFSET(char_device_struct_name)); if (name) { - if (!read_string(name, buf, BUFSIZE-1)) + if (THIS_KERNEL_VERSION >= LINUX(2,6,16)) + sprintf(buf,char_device_struct_buf+OFFSET(char_device_struct_name)); + else if (!read_string(name, buf, BUFSIZE-1)) sprintf(buf, "(unknown)"); } else sprintf(buf, "(unknown)"); ---------------------------------------------------------------- Ce message a ete envoye par IMP, grace a l'Universite Paris 10 Nanterre

17 years, 4 months

2
1
0 / 0

Right way to display contents of memory[crash on ia64]

by Dheeraj Sangamkar

Hi, I am using crash 4.0-2.30 on an ia64 machine. The memory dump of the stack shows parameters on the stack, one of which is a user space pointer. e00000014c930ed8: __gp v+4643276848 e00000014c930ee8: 60000fffffffb390 00000000000000ff e00000014c930ef8: v+4643276864 v+5579701608 e00000014c930f08: sys_readlink+480 0000000000000792 OR e00000014c930ed8: a0000001009bb820 e000000114c2c830 .......0....... e00000014c930ee8: 60000fffffffb390 00000000000000ff .......`........ e00000014c930ef8: e000000114c2c840 e00000014c937d68 @.......h}.L.... e00000014c930f08: a00000010013da60 0000000000000792 `............... I want to find what the parameter v+4643276848/e000000114c2c830 points to. I used rd to print this but I dont see what I expect. (Used "rd e000000114c2c830 10") What's the right way to inspect that memory? Dheeraj

17 years, 4 months

2
1
0 / 0

Heads up: crash command errors with 2.6.24 kernels

by Dave Anderson

It should be noted that while version 4.0-4.11 will at least allow a crash session to initialize, there are several other 2.6.24 related kernel changes that have broken several key commands. Among them, at least on x86_64 kernels: 1. "kmem -[sS]" fails due to changes in the CONFIG_SLUB code between 2.6.22 and 2.6.24. 2. "kmem <address>" doesn't work at all. 3. "kmem -n" fails to show any pgdat-node related information. 4. "kmem -f" doesn't work at all. 5. "kmem -i" doesn't work at all. 6. "runq" for the CFS scheduler no longer shows any queued tasks, but only the relevant structure addresses. 7. The kernel's use of a virtual mem_map array on x86_64 is not handled, and this may lead to other page struct related errors. Dave

17 years, 4 months

2
2
0 / 0

crash version 4.0-4.11 is available

by Dave Anderson

- Fix for task-gathering to handle the 2.6.24 pid_namespace-related changes to the kernel pid_hash array. Without the patch, the crash session fails during initialization with the message "crash: cannot gather a stable task list via pid_hash (500 retries)". (anderson(a)redhat.com) - Fix for "kmem -f <address>" and "kmem <address>" commands on x86 kernels, which may incorrectly indicate that the address is in the kernel's free page list. Without this patch, if the address argument is a physical address over 4GB, or a page struct address referencing a physical address over 4GB, it is possible that the address would incorrectly be shown as being in the kernel's free page list. (anderson(a)redhat.com) - Fix for x86 "bt" command for active tasks in Egenera dumpfiles based upon LKCD version 7. Without the patch, the starting points for the active task backtraces were erroneous. (anderson(a)redhat.com) - Fix for a potential segmentation violation during crash session initialization if a task's kernel stack has been completely overrun, corrupting its thread_info structure at the bottom of the stack. This could occur running against kernels from 2.6.8 through 2.6.18. With the patch, the suspect task will be reported during the task initialization sequence. (anderson(a)redhat.com) - Fix for "kmem -S" error message if a slab object is found in both a per-cpu list and on a slab's global free list. Without the patch, the object address and cpu number values are flip-flopped in the error message. (bob.montgomery(a)hp.com) Download from: http://people.redhat.com/anderson

17 years, 4 months

1
0
0 / 0

typo affects kmem -S error output

by Bob Montgomery

Dave, This patch fixes a typo in memory.c. Before: ======= crash> kmem -S sctp_bind_bucket ... kmem: "sctp_bind_bucket" cache: object 0 on both free and cpu 651223584 lists ... (Note cpu number) After: ====== crash> kmem -S sctp_bind_bucket ... kmem: "sctp_bind_bucket" cache: object ffff810126d0e220 on both free and cpu 0 lists ... Bob Montgomery Working at HP in Fort Collins

17 years, 4 months

2
1
0 / 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Crash-utility December 2007