Question for LKCD maintainers
                                
                                
                                
                                    
                                        by Dave Anderson
                                    
                                
                                
                                        
Long after I stopped tinkering with the LKCD code in crash,
changes were contributed to support physical memory zones
in the LKCD dumpfile format.  Specifically there is this
piece of save_offset() in lkcd_common.c:
         /* find the zone */
         for (ii=0; ii < lkcd->num_zones; ii++) {
                 if (lkcd->zones[ii].start == zone) {
                         if (lkcd->zones[ii].pages[page].offset != 0) {
                            if (lkcd->zones[ii].pages[page].offset != off) {
                                 error(INFO, "conflicting page: zone %lld, "
                                         "page %lld: %lld, %lld != %lld\n",
                                         (unsigned long long)zone,
                                         (unsigned long long)page,
                                         (unsigned long long)paddr,
                                         (unsigned long long)off,
                                         (unsigned long long) \
                                             lkcd->zones[ii].pages[page].offset);
                                 abort();
                            }
                            ret = 0;
                         } else {
                            lkcd->zones[ii].pages[page].offset = off;
                            ret = 1;
                         }
                         break;
                 }
         }
The call to abort() above kills the crash session, which is both
annoying and unnecessary.
I am seeing it in a customer dumpfile, who have their own dumping scheme
that is based upon LKCD version 7.  I understand that this may be a
problem with their LKCD port, but nonetheless, it's the only place in
the crash utility that doesn't recover gracefully from dumpfile access
errors.
Anyway, I would like to either:
  1. change the error(INFO...) to error(FATAL...) so that run-time
     commands encountering this error will just fail, and the session
     will return to the crash> prompt, or
  2. return 0, so that a "seek error" can be subsequently displayed
     by the readmem() command.
Number 2 is preferable, because it yields more clues as to where the
readmem() came from, but since I don't know much about the LKCD
physical memory zones stuff, is there any reason that shouldn't
be done?
Thanks,
   Dave
                                
                         
                        
                                
                                17 years, 10 months
                        
                        
                 
         
 
        
            
        
        
        
                
                        
                                
                                
                                        
                                
                         
                        
                                
                                
                                        
                                                
                                        
                                        
                                        [PATCH] Improve error handling when architecture doesn't match
                                
                                
                                
                                    
                                        by Bernhard Walle
                                    
                                
                                
                                        Currently, crash prints always
        crash: vmcore: not a supported file format
if you try to open a dump file which is not supported. However, it can be
misleading if you have a valid ELF core dump, but just use crash for the wrong
architecture. In the case I observed the user had a ELF64 x86 dump file and
assumed it's x86-64. However, it just was a i386 core dump which was ELF64
because kexec was called with --elf64-core-headers which makes sense
if the i386 machine has PAE and possibly more than 4 GiB of physical RAM.
After that patch is applied, an example output is
        Looks like a valid ELF dump, but host architecture (X86_64) \
        doesn't match dump architecture (IA64).
or if I try to open a PPC64 dump on x86-64:
        Looks like a valid ELF dump, but host endianess (LE) \
        doesn't match target endianess (BE)
Please review and consider applying.
Signed-off-by: Bernhard Walle <bwalle(a)suse.de>
---
 defs.h    |    3 ++-
 netdump.c |   48 +++++++++++++++++++++++++++++++++++++++++++-----
 tools.c   |    9 ++++++++-
 3 files changed, 53 insertions(+), 7 deletions(-)
--- a/defs.h
+++ b/defs.h
@@ -3198,7 +3198,8 @@ void stall(ulong);
 char *pages_to_size(ulong, char *);
 int clean_arg(void);
 int empty_list(ulong);
-int machine_type(char *);
+int machine_type(const char *);
+int is_big_endian(void);
 void command_not_supported(void);
 void option_not_supported(int);
 void please_wait(char *);
--- a/netdump.c
+++ b/netdump.c
@@ -36,6 +36,32 @@ static void check_dumpfile_size(char *);
 #define ELFREAD  0
 
 #define MIN_PAGE_SIZE (4096)
+
+
+/*
+ * Checks if the machine type of the host matches required_type.
+ * If not, it prints a short error message for the user.
+ */
+static int machine_type_error(const char *required_type)
+{
+	if (machine_type(required_type))
+		return 1;
+	else {
+		fprintf(stderr, "Looks like a valid ELF dump, but host "
+				"architecture (%s) doesn't match dump "
+				"architecture (%s).\n",
+				MACHINE_TYPE, required_type);
+		return 0;
+	}
+}
+
+/*
+ * Returns endianess in a string
+ */
+static const char *endianess_to_string(int big_endian)
+{
+	return big_endian ? "BE" : "LE";
+}
 	
 /*
  *  Determine whether a file is a netdump/diskdump/kdump creation, 
@@ -98,6 +124,18 @@ is_netdump(char *file, ulong source_quer
 	 *  If either kdump difference is seen, presume kdump -- this
 	 *  is obviously subject to change.
 	 */
+
+	/* check endianess */
+	if ((STRNEQ(elf32->e_ident, ELFMAG) || STRNEQ(elf64->e_ident, ELFMAG)) &&
+			(elf32->e_type == ET_CORE || elf64->e_type == ET_CORE) &&
+			(elf32->e_ident[EI_DATA] == ELFDATA2LSB && is_big_endian()) ||
+			(elf32->e_ident[EI_DATA] == ELFDATA2MSB && !is_big_endian()))
+		fprintf(stderr, "Looks like a valid ELF dump, but host "
+				"endianess (%s) doesn't match target "
+				"endianess (%s)\n",
+			endianess_to_string(is_big_endian()),
+			endianess_to_string(elf32->e_ident[EI_DATA] == ELFDATA2MSB));
+
         if (STRNEQ(elf32->e_ident, ELFMAG) && 
 	    (elf32->e_ident[EI_CLASS] == ELFCLASS32) &&
   	    (elf32->e_ident[EI_DATA] == ELFDATA2LSB) &&
@@ -108,7 +146,7 @@ is_netdump(char *file, ulong source_quer
 		switch (elf32->e_machine)
 		{
 		case EM_386:
-			if (machine_type("X86"))
+			if (machine_type_error("X86"))
 				break;
 		default:
                 	goto bailout;
@@ -133,28 +171,28 @@ is_netdump(char *file, ulong source_quer
 		{
 		case EM_IA_64:
 			if ((elf64->e_ident[EI_DATA] == ELFDATA2LSB) &&
-				machine_type("IA64"))
+				machine_type_error("IA64"))
 				break;
 			else
 				goto bailout;
 
 		case EM_PPC64:
 			if ((elf64->e_ident[EI_DATA] == ELFDATA2MSB) &&
-				machine_type("PPC64"))
+				machine_type_error("PPC64"))
 				break;
 			else
 				goto bailout;
 
 		case EM_X86_64:
 			if ((elf64->e_ident[EI_DATA] == ELFDATA2LSB) &&
-				machine_type("X86_64"))
+				machine_type_error("X86_64"))
 				break;
 			else
 				goto bailout;
 
 		case EM_386:
 			if ((elf64->e_ident[EI_DATA] == ELFDATA2LSB) &&
-				machine_type("X86"))
+				machine_type_error("X86"))
 				break;
 			else
 				goto bailout;
--- a/tools.c
+++ b/tools.c
@@ -4518,11 +4518,18 @@ empty_list(ulong list_head_addr)
 }
 
 int
-machine_type(char *type)
+machine_type(const char *type)
 {
 	return STREQ(MACHINE_TYPE, type);
 }
 
+int
+is_big_endian(void)
+{
+	unsigned short value = 0xff;
+	return *((unsigned char *)&value) != 0xff;
+}
+
 void
 command_not_supported()
 {
                                
                         
                        
                                
                                17 years, 10 months
                        
                        
                 
         
 
        
            
        
        
        
                
                        
                                
                                
                                        
                                
                         
                        
                                
                                
                                        
                                                
                                        
                                        
                                        x86 backtrace is dependent upon struct pt_regs at compile time
                                
                                
                                
                                    
                                        by Alan Tyson
                                    
                                
                                
                                        This problem has been reported before, but the discussion on it seemed
to move off track and I don't think that anyone really found the root cause.
The problem is that the x86 backtrace functionality in crash is
dependent upon the struct pt_regs taken from <asm/ptrace.h> at compile
time.  struct pt_regs changed in 2.6.20.  The result of this is that if
crash is compiled on 2.6.20 or later and subsequently used to look at a
2.6.19 or earlier dump, then exception frames are incorrectly displayed
and backtraces stop at them.
Here is an example of a 2.6.22-compiled crash displaying a trace from a
RHEL5 (2.6.18) dump:
crash> bt
PID: 3490   TASK: f7f5a000  CPU: 0   COMMAND: "insmod"
 #0 [f664ddd0] crash_kexec at c0441c78
 #1 [f664de14] die at c04064a4
 #2 [f664de44] do_page_fault at c0605eea
 #3 [f664de94] error_code (via page_fault) at c0405a6f
    EAX: 00000000  EBX: f8dd3400  ECX: 00200082  EDX: 00200000
    DS:  007b      ESI: f7bbeab0  ES:  007b      EDI: f7bbe800
    SS:  ffffe800      ESP: 00000000  EBP: f7bbead8
    CS:  0060      EIP: f8dd300d  ERR: ffffffff  EFLAGS: 00210296
crash>
Note that in the above, crash thinks that the exception frame is a user
mode one and not a kernel frame.
If crash was compiled on RHEL5 (2.6.18), then the trace looks like this:
crash> bt
PID: 3490   TASK: f7f5a000  CPU: 0   COMMAND: "insmod"
 #0 [f664ddd0] crash_kexec at c0441c78
 #1 [f664de14] die at c04064a4
 #2 [f664de44] do_page_fault at c0605eea
 #3 [f664de94] error_code (via page_fault) at c0405a6f
    EAX: 00000000  EBX: f8dd3400  ECX: 00200082  EDX: 00200000  EBP:
f7bbead8
    DS:  007b      ESI: f7bbeab0  ES:  007b      EDI: f7bbe800
    CS:  0060      EIP: f8dd300d  ERR: ffffffff  EFLAGS: 00210296
 #4 [f664dec8] function2 at f8dd300d
 #5 [f664dee0] sys_init_module at c043e717
 #6 [f664dfb8] system_call at c0404ef8
    EAX: ffffffda  EBX: 0861a028  ECX: 00010144  EDX: 0861a018
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 00307ff4
    SS:  007b      ESP: bfe5695c  EBP: bfe569a8
    CS:  0073      EIP: 00d37402  ERR: 00000080  EFLAGS: 00200206
crash>
A similar problem happens if crash is compiled on pre-2.6.20 and then
used to analyse a 2.6.20 or later dump.
Dave, I have attached a patch to this e-mail which removes the
dependence upon <asm/prtrace.h> from lkcd_x86_trace.c (which is used for
non-LKCD dumps as well as LKCD dumps by the way).  I notice that
eframe_init() in x86.c initialises several variables which correspond to
the struct pt_regs so I've had to make these external for
lkcd_x86_trace.c's use.  I have no problem in this being reworked if you
feel that these symbols really should be in defs.h (or any other rework
that you think is fit, for that matter).
Regards,
Alan Tyson, HP.
                                
                         
                        
                                
                                17 years, 10 months
                        
                        
                 
         
 
        
            
        
        
        
                
                        
                                
                                
                                        
                                
                         
                        
                                
                                
                                        
                                                
                                        
                                        
                                        Re: [Crash-utility] problems running crash on recent rawhide live kernels
                                
                                
                                
                                    
                                        by Dave Anderson
                                    
                                
                                
                                        
 > Jeff Layton wrote:
 > > Relevant packages:
 > >
 > >     kernel-2.6.24-0.62.rc3.git5.fc9.x86_64
 > >     kernel-debuginfo-2.6.24-0.62.rc3.git5.fc9.x86_64
 > >     crash-4.0-4.10.x86_64
 > >
 > > ... the host is a FV xen guest (but that shouldn't matter, should
 > > it?).
To get crash version 4.0-4.11 to run against that particular
dumpfile, it needs to know the kernel's "phys_base" relocation
value.  And I don't know how (or if it's even possible) to get
it from a fully-virtualized Xen guest dumpfile.  However, if
you run crash on the live on the kernel that panicked, you can
determine it.  So running live on kernel-2.6.24-0.62.rc3.git5.fc9
I see:
   crash> help -m | grep phys_base
                   phys_base: ffffffffff200000
   crash>
...which in turn can be used as a command line argument for the
xendump dumpfile from that kernel.  So taking the sample dumpfile
you gave me:
# crash --machdep phys_base=0xffffffffff200000 vmlinux vmcore-rawhide.xmdump
crash 4.0-4.11
Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
NOTE: setting phys_base to: 0xffffffffff200000
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
       KERNEL: vmlinux
     DUMPFILE: vmcore-rawhide.xmdump
         CPUS: 1
         DATE: Tue Dec  4 15:41:08 2007
       UPTIME: 06:10:51
LOAD AVERAGE: 0.00, 0.00, 0.00
        TASKS: 74
     NODENAME: dhcp231-229.rdu.redhat.com
      RELEASE: 2.6.24-0.62.rc3.git5.fc9
      VERSION: #1 SMP Sat Dec 1 13:59:08 EST 2007
      MACHINE: x86_64  (3458 Mhz)
       MEMORY: 511.6 MB
        PANIC: "SysRq : Trigger a crashdump"
          PID: 0
      COMMAND: "swapper"
         TASK: ffffffff813a1780  [THREAD_INFO: ffffffff81496000]
          CPU: 0
        STATE: TASK_RUNNING (ACTIVE)
crash>
Pain in the ass.  But I don't know any better way.
Dave
                                
                         
                        
                                
                                17 years, 10 months
                        
                        
                 
         
 
        
            
        
        
        
                
                        
                                
                                
                                        
                                
                         
                        
                                
                                
                                        
                                                
                                        
                                        
                                        crash version 4.0-4.12 is available
                                
                                
                                
                                    
                                        by Dave Anderson
                                    
                                
                                
                                        
- Fix for the "kmem -n" command to handle the 2.6.24 kernel replacement
   of the "node_online_map" nodemask with its appropriate entry in the
   new "node_states[]" nodemask array.  Without the patch, the per-node
   zone data would not be displayed, and any commands depending upon
   the node table data would be affected.  (anderson(a)redhat.com)
- Fix for "kmem -p" on 2.6.24 x86_64 kernels that are configured with
   CONFIG_SPARSEMEM_VMEMMAP,  which use a virtually-mapped page struct
   array.  Without the patch, the virtual-to-physical translation of
   each page structure was invalid, and "kmem -p" would display invalid
   data.  This would also affect other commands as well, such as the
   output of "kmem -i", and the output of a "vtop" command on a mapped
   page address.  Also, the virtual base address of the region is now
   displayed by the "mach" command.
   (oomichi(a)mxs.nes.nec.co.jp, anderson(a)redhat.com)
- Fix for the "dev" command's character device name string output to
   recognize the change of the name structure member from a pointer
   to an embedded string.  Without the patch, 2.6.16 and later kernels
   would display "(unknown)" character device names.
   (olivier.daudel(a)u-paris10.fr, anderson(a)redhat.com)
- Fix for the "kmem -[sS]" command to handle the 2.6.24 change to
   the CONFIG_SLUB kmem_cache structure, which re-worked the manner
   in which the per-cpu slabs get referenced.  Without the patch,
   the command would fail with several error messages of the type:
   "kmem: page_to_nid: invalid page: ffff81003993f4b0".
   (anderson(a)redhat.com)
- Fix for the "kmem -[fF]" command to handle the 2.6.24 kernel change
   of the free_area struct, which replaced the singular linked list
   of pages with 5 (MIGRATE_TYPES) linked lists.  Without the patch,
   the command would fail with the error message: "kmem: unrecognized
   free_area struct size: 88".   (anderson(a)redhat.com)
- Fix for the "runq" command to handle the 2.6.24 kernel change to
   the CFS scheduler that introduced per-cpu init_cfs_rq structures
   for task group scheduling.  Without the patch, no queued tasks
   were displayed, because the rb_root of queued tasks was being
   taken from the embedded cfs_rq in each per-cpu runqueue.
   (anderson(a)redhat.com)
   Download from: http://people.redhat.com/anderson
                                
                         
                        
                                
                                17 years, 11 months
                        
                        
                 
         
 
        
            
        
        
        
                
                        
                                
                                
                                        
                                
                         
                        
                                
                                
                                        
                                                
                                        
                                        
                                        Patch for command dev
                                
                                
                                
                                    
                                        by Olivier Daudel
                                    
                                
                                
                                        
Hello Dave,
A small patch par dev.c.
If i am correct, with 2.6.16, name in chrdevs becomes a table.
crash> dev
CHRDEV    NAME            OPERATIONS
   1      (unknown)               (none)
   4      (unknown)               (none)
   4      (unknown)               (none)
   4      (unknown)               (none)
   5      (unknown)               (none)
With the patch :
crash> dev
CHRDEV    NAME            OPERATIONS
   1      mem                     (none)
   4      /dev/vc/0               (none)
   4      tty                     (none)
   4      ttyS                    (none)
   5      /dev/tty                (none)
--- crash-4.0-4.11/dev.c        2007-12-06 16:47:06.000000000 +0100
+++ crash-4.0-4.11-change/dev.c 2007-12-10 17:13:30.000000000 +0100
@@ -202,7 +202,9 @@
                name = ULONG(char_device_struct_buf +
                        OFFSET(char_device_struct_name));
                 if (name) {
-                       if (!read_string(name, buf, BUFSIZE-1))
+                       if (THIS_KERNEL_VERSION >= LINUX(2,6,16))
+                               
sprintf(buf,char_device_struct_buf+OFFSET(char_device_struct_name));
+                       else if (!read_string(name, buf, BUFSIZE-1))
                                  sprintf(buf, "(unknown)");
                 } else
                         sprintf(buf, "(unknown)");
@@ -244,7 +246,9 @@
                        name = ULONG(char_device_struct_buf +
                                OFFSET(char_device_struct_name));
                        if (name) {
-                               if (!read_string(name, buf, BUFSIZE-1))
+                               if (THIS_KERNEL_VERSION >= LINUX(2,6,16))
+                                       
sprintf(buf,char_device_struct_buf+OFFSET(char_device_struct_name));
+                               else if (!read_string(name, buf, BUFSIZE-1))
                                         sprintf(buf, "(unknown)");
                        } else
                                sprintf(buf, "(unknown)");
----------------------------------------------------------------
Ce message a ete envoye par IMP, grace a l'Universite Paris 10 Nanterre
                                
                         
                        
                                
                                17 years, 11 months
                        
                        
                 
         
 
        
            
        
        
        
                
                        
                        
                                
                                
                                        
                                                
                                        
                                        
                                        Right way to display contents of memory[crash on ia64]
                                
                                
                                
                                    
                                        by Dheeraj Sangamkar
                                    
                                
                                
                                        Hi,
I am using crash 4.0-2.30 on an ia64 machine.
The memory dump of the stack shows parameters on the stack, one of which is
a user space pointer.
e00000014c930ed8:  __gp             v+4643276848
e00000014c930ee8:  60000fffffffb390 00000000000000ff
e00000014c930ef8:  v+4643276864     v+5579701608
e00000014c930f08:  sys_readlink+480 0000000000000792
OR
e00000014c930ed8:  a0000001009bb820 e000000114c2c830    .......0.......
e00000014c930ee8:  60000fffffffb390 00000000000000ff   .......`........
e00000014c930ef8:  e000000114c2c840 e00000014c937d68   @.......h}.L....
e00000014c930f08:  a00000010013da60 0000000000000792   `...............
I want to find what the parameter v+4643276848/e000000114c2c830 points to.
I used rd to print this but I dont see what I expect.  (Used "rd
e000000114c2c830 10")
What's the right way to inspect that memory?
Dheeraj
                                
                         
                        
                                
                                17 years, 11 months
                        
                        
                 
         
 
        
            
        
        
        
                
                        
                        
                                
                                
                                        
                                                
                                        
                                        
                                        Heads up: crash command errors with 2.6.24 kernels
                                
                                
                                
                                    
                                        by Dave Anderson
                                    
                                
                                
                                        
It should be noted that while version 4.0-4.11 will at least allow
a crash session to initialize, there are several other 2.6.24 related
kernel changes that have broken several key commands.  Among them, at
least on x86_64 kernels:
  1. "kmem -[sS]" fails due to changes in the CONFIG_SLUB code between
     2.6.22 and 2.6.24.
  2. "kmem <address>" doesn't work at all.
  3. "kmem -n" fails to show any pgdat-node related information.
  4. "kmem -f" doesn't work at all.
  5. "kmem -i" doesn't work at all.
  6. "runq" for the CFS scheduler no longer shows any queued tasks,
     but only the relevant structure addresses.
  7. The kernel's use of a virtual mem_map array on x86_64 is not
     handled, and this may lead to other page struct related errors.
Dave
                                
                         
                        
                                
                                17 years, 11 months
                        
                        
                 
         
 
        
            
        
        
        
                
                        
                                
                                
                                        
                                
                         
                        
                                
                                
                                        
                                                
                                        
                                        
                                        crash version 4.0-4.11 is available
                                
                                
                                
                                    
                                        by Dave Anderson
                                    
                                
                                
                                        
- Fix for task-gathering to handle the 2.6.24 pid_namespace-related
   changes to the kernel pid_hash array.  Without the patch, the crash
   session fails during initialization with the message "crash:  cannot
   gather a stable task list via pid_hash (500 retries)".
   (anderson(a)redhat.com)
- Fix for "kmem -f <address>" and "kmem <address>" commands on
   x86 kernels, which may incorrectly indicate that the address is in
   the kernel's free page list.  Without this patch, if the address
   argument is a physical address over 4GB, or a page struct address
   referencing a physical address over 4GB, it is possible that the
   address would incorrectly be shown as being in the kernel's free
   page list.  (anderson(a)redhat.com)
- Fix for x86 "bt" command for active tasks in Egenera dumpfiles
   based upon LKCD version 7.  Without the patch, the starting points
   for the active task backtraces were erroneous.
   (anderson(a)redhat.com)
- Fix for a potential segmentation violation during crash session
   initialization if a task's kernel stack has been completely overrun,
   corrupting its thread_info structure at the bottom of the stack.
   This could occur running against kernels from 2.6.8 through 2.6.18.
   With the patch, the suspect task will be reported during the task
   initialization sequence.  (anderson(a)redhat.com)
- Fix for "kmem -S" error message if a slab object is found in both
   a per-cpu list and on a slab's global free list.  Without the patch,
   the object address and cpu number values are flip-flopped in the
   error message.  (bob.montgomery(a)hp.com)
Download from: http://people.redhat.com/anderson
                                
                         
                        
                                
                                17 years, 11 months
                        
                        
                 
         
 
        
            
        
        
        
                
                        
                        
                                
                                
                                        
                                                
                                        
                                        
                                        typo affects kmem -S error output
                                
                                
                                
                                    
                                        by Bob Montgomery
                                    
                                
                                
                                        Dave,
This patch fixes a typo in memory.c.
Before:
=======
crash> kmem -S sctp_bind_bucket
...
kmem: "sctp_bind_bucket" cache: object 0 on both free and cpu 651223584
lists
...
(Note cpu number)
After:
====== 
crash> kmem -S sctp_bind_bucket
...
kmem: "sctp_bind_bucket" cache: object ffff810126d0e220 on both free and
cpu 0 lists
...
Bob Montgomery
Working at HP in Fort Collins
                                
                         
                        
                                
                                17 years, 11 months