mount cmd crashes crash
by Bob Montgomery
I'm working on a dump of a system that did not have a PID 1. I don't
think it's relevant to the crash itself, but it does cause crash get
a seg fault.
crash> ps | head
PID PPID CPU TASK ST %MEM VSZ RSS COMM
0 0 0 ffffffff805144c0 RU 0.0 0 0 [swapper]
0 -1 1 ffff81012bc0a100 RU 0.0 0 0 [swapper]
2 -1 0 ffff81012bd3c040 IN 0.0 0 0 [migration/0]
3 -1 0 ffff81012bd3e7c0 RU 0.0 0 0 [ksoftirqd/0]
4 -1 0 ffff81012bd3e080 IN 0.0 0 0 [watchdog/0]
5 -1 1 ffff81012bd3f800 IN 0.0 0 0 [migration/1]
6 -1 1 ffff81012bd3f0c0 RU 0.0 0 0 [ksoftirqd/1]
7 -1 1 ffff81012bc0a840 IN 0.0 0 0 [watchdog/1]
8 -1 0 ffff81012af02880 IN 0.0 0 0 [events/0]
crash> mount
Segmentation fault (core dumped)
In cmd_mount, this returns null and subsequent use causes the seg fault:
1156
1157 namespace_context = pid_to_context(1);
I don't know if it was important to have the context of pid 1 for
reporting mounts, or just any context, but this hack makes the problem
go away, although not a very efficient way to find the lowest existing
PID above 0.
--- filesys.c.orig 2010-08-18 14:03:26.000000000 -0600
+++ filesys.c 2010-08-18 14:10:02.000000000 -0600
@@ -1153,8 +1153,12 @@ cmd_mount(void)
ulong vfsmount = 0;
int flags = 0;
int save_next;
+ ulong pid;
- namespace_context = pid_to_context(1);
+ /* find a context */
+ pid = 1;
+ while ((namespace_context = pid_to_context(pid)) == NULL)
+ pid++;
while ((c = getopt(argcnt, args, "ifn:")) != EOF) {
switch(c)
Bob Montgomery
At HP
14 years, 3 months
Re: [Crash-utility] crash: invalid structure member offset
by Dave Anderson
----- "Reinoud Koornstra" <koornstra(a)hp.com> wrote:
> > -----Original Message-----
> > From: crash-utility-bounces(a)redhat.com [mailto:crash-utility-
> > bounces(a)redhat.com] On Behalf Of Dave Anderson
> > Sent: Thursday, August 12, 2010 12:18 PM
> > To: Discussion list for crash utility usage, maintenance and
> > development
> > Subject: Re: [Crash-utility] crash: invalid structure member offset
> >
> >
> > ----- "Reinoud Koornstra" <koornstra(a)hp.com> wrote:
> >
> > > Thanks,
> > >
> > > Using crash 5.0.6 worked nicely.
> > > However, I can't really look at a lot because of a bad EIP code.
> > >
> > > [ 726.601381] 802.1Q VLAN Support v1.8 Ben Greear <greearb(a)candelatech.com>
> > > [ 726.601384] All bugs added by David S. Miller <davem(a)redhat.com>
> > > [ 726.646757] BUG: unable to handle kernel NULL pointer dereference at 00000000
> > > [ 726.732410] IP: [<00000000>]
> > > [ 726.766933] *pdpt = 0000000000431001 *pde = 0000000000000000
> > > [ 726.766937] Oops: 0010 [#1] SMP
> > > [ 726.790844] Modules linked in: 8021q iptable_filter ip_tables
> > > x_tables ip_gre af_packet i2c_dev i2c_qs i2c_algo_bit i2c_core garp
> > > stp llc ixgbe inet_lro psmouse serio_raw intel_agp shpchp iTCO_wdt
> > > pci_hotplug iTCO_vendor_support agpgart ext3 jbd mbcache sd_mod
> > > crc_t10dif sg ata_piix ata_generic ahci libata scsi_mod ehci_hcd
> > > uhci_hcd usbcore [last unloaded: 8021q]
> > > [ 726.790844]
> > > [ 726.790844] Pid: 4, comm: ksoftirqd/0 Tainted: P (2.6.27)
> > > [ 726.790844] EIP: 0060:[<00000000>] EFLAGS: 00010202 CPU: 0
> > > [ 726.790844] EIP is at 0x0
> > > [ 726.790844] EAX: e7f4c498 EBX: 00000000 ECX: 77470000 EDX: e7f4c498
> > > [ 726.790844] ESI: 4bd1d300 EDI: 00000007 EBP: f784df88 ESP: f784df78
> > > [ 726.790844] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> > > [ 726.790844] Process ksoftirqd/0 (pid: 4, ti=f784c000 task=f783a5b0 task.ti=f784c000)
> > > [ 726.790844] Stack: 40168080 00000001 403daaa0 4042c500 f784df90 401681bf f784dfb0 4012fe92
> > > [ 726.790844] 0000000a 00000000 40429340 00000246 00000000 40130120 f784dfbc 4012ff55
> > > [ 726.790844] 4042c500 f784dfcc 40130182 fffffffc 00000000
> > f784dfe0 4013e707 4013e6c0
> > > [ 726.790844] Call Trace:
> > > [ 726.790844] [<40168080>] ? __rcu_process_callbacks+0x70/0x190
> > > [ 726.790844] [<401681bf>] ? rcu_process_callbacks+0x1f/0x40
> > > [ 726.790844] [<4012fe92>] ? __do_softirq+0x82/0x100
> > > [ 726.790844] [<40130120>] ? ksoftirqd+0x0/0xe0
> > > [ 726.790844] [<4012ff55>] ? do_softirq+0x45/0x50
> > > [ 726.790844] [<40130182>] ? ksoftirqd+0x62/0xe0
> > > [ 726.790844] [<4013e707>] ? kthread+0x47/0x80
> > > [ 726.790844] [<4013e6c0>] ? kthread+0x0/0x80
> > > [ 726.790844] [<4010494f>] ? kernel_thread_helper+0x7/0x10
> > > [ 726.790844] =======================
> > > [ 726.790844] Code: Bad EIP value.
> > > [ 726.790844] EIP: [<00000000>] 0x0 SS:ESP 0068:f784df78
> > >
> > > So now I can't figure out the piece of code where this dereferencing
> > > occurred. :(
> >
> > Yeah, I don't know why the exception frame didn't displayed below in the
> > bt output, but I think it may have been confusion due the kernel text
> > region starting a 4000000 (instead of the typical 3G/1G user/kernel virtual
> > address split). I'm guessing your kernel is configured as 1G/3G user-kernel?
>
> That's right, the kernel is configured as 1G/3G user/kernel.
>
> > (I've never seen that before...)
>
> It's a weird config indeed. I'll try rewriting some stuff so it
> consumes way less memory so a normal kernel/user split can be used.
> Never the less, why the pointer became null remains unsolved for the moment. :-)
> Would the user/kernel split also be an issue in 64 bit?
I wouldn't expect you'd ever need to modify the user-kernel split in x86_64,
if that's what you're asking? The 64-bit virtual address range is so vast
that it's hard to conceive of a need to do anything like that.
Dave
14 years, 3 months
Re: [Crash-utility] crash: invalid structure member offset
by Dave Anderson
----- "Reinoud Koornstra" <koornstra(a)hp.com> wrote:
> Thanks,
>
> Using crash 5.0.6 worked nicely.
> However, I can't really look at a lot because of a bad EIP code.
>
> [ 726.601381] 802.1Q VLAN Support v1.8 Ben Greear <greearb(a)candelatech.com>
> [ 726.601384] All bugs added by David S. Miller <davem(a)redhat.com>
> [ 726.646757] BUG: unable to handle kernel NULL pointer dereference at 00000000
> [ 726.732410] IP: [<00000000>]
> [ 726.766933] *pdpt = 0000000000431001 *pde = 0000000000000000
> [ 726.766937] Oops: 0010 [#1] SMP
> [ 726.790844] Modules linked in: 8021q iptable_filter ip_tables
> x_tables ip_gre af_packet i2c_dev i2c_qs i2c_algo_bit i2c_core garp
> stp llc ixgbe inet_lro psmouse serio_raw intel_agp shpchp iTCO_wdt
> pci_hotplug iTCO_vendor_support agpgart ext3 jbd mbcache sd_mod
> crc_t10dif sg ata_piix ata_generic ahci libata scsi_mod ehci_hcd
> uhci_hcd usbcore [last unloaded: 8021q]
> [ 726.790844]
> [ 726.790844] Pid: 4, comm: ksoftirqd/0 Tainted: P (2.6.27)
> [ 726.790844] EIP: 0060:[<00000000>] EFLAGS: 00010202 CPU: 0
> [ 726.790844] EIP is at 0x0
> [ 726.790844] EAX: e7f4c498 EBX: 00000000 ECX: 77470000 EDX: e7f4c498
> [ 726.790844] ESI: 4bd1d300 EDI: 00000007 EBP: f784df88 ESP: f784df78
> [ 726.790844] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> [ 726.790844] Process ksoftirqd/0 (pid: 4, ti=f784c000 task=f783a5b0 task.ti=f784c000)
> [ 726.790844] Stack: 40168080 00000001 403daaa0 4042c500 f784df90 401681bf f784dfb0 4012fe92
> [ 726.790844] 0000000a 00000000 40429340 00000246 00000000 40130120 f784dfbc 4012ff55
> [ 726.790844] 4042c500 f784dfcc 40130182 fffffffc 00000000 f784dfe0 4013e707 4013e6c0
> [ 726.790844] Call Trace:
> [ 726.790844] [<40168080>] ? __rcu_process_callbacks+0x70/0x190
> [ 726.790844] [<401681bf>] ? rcu_process_callbacks+0x1f/0x40
> [ 726.790844] [<4012fe92>] ? __do_softirq+0x82/0x100
> [ 726.790844] [<40130120>] ? ksoftirqd+0x0/0xe0
> [ 726.790844] [<4012ff55>] ? do_softirq+0x45/0x50
> [ 726.790844] [<40130182>] ? ksoftirqd+0x62/0xe0
> [ 726.790844] [<4013e707>] ? kthread+0x47/0x80
> [ 726.790844] [<4013e6c0>] ? kthread+0x0/0x80
> [ 726.790844] [<4010494f>] ? kernel_thread_helper+0x7/0x10
> [ 726.790844] =======================
> [ 726.790844] Code: Bad EIP value.
> [ 726.790844] EIP: [<00000000>] 0x0 SS:ESP 0068:f784df78
>
> So now I can't figure out the piece of code where this dereferencing
> occurred. :(
Yeah, I don't know why the exception frame didn't displayed below in the
bt output, but I think it may have been confusion due the the kernel text
region starting a 4000000 (instead of the typical 3G/1G user/kernel virtual
address split). I'm guessing your kernel is configured as 1G/3G user-kernel?
(I've never seen that before...)
Anyway, somehow the EIP got zeroed out, and it took a fault trying
to handle that. That can happen if a kernel function corrupts its
own stack by incorrectly writing to its own local stack variables,
and in so doing writes a zero into the return address saved on the
stack. Then when the function returns, that zero is loaded into the
EIP, and you'd see something like the above.
The exception frame in the log shows that the ESP is f784df78,
and looking at the trace data below, it looks like rcu_process_callbacks()
may have ended up calling something that lead to the EIP corruption.
Just a guess though...
Dave
>
> crash> bt
> PID: 4 TASK: f783a5b0 CPU: 0 COMMAND: "ksoftirqd/0"
> #0 [f784de88] crash_kexec at 401534a8
> #1 [f784df28] __slab_free at 4019677f
> #2 [f784df8c] rcu_process_callbacks at 401681ba
> #3 [f784df94] __do_softirq at 4012fe90
> #4 [f784dfb4] do_softirq at 4012ff50
> #5 [f784dfd0] kthread at 4013e705
> #6 [f784dfe4] kernel_thread_helper at 4010494d
>
> Thanks,
>
> Reinoud.
>
>
> > -----Original Message-----
> > From: crash-utility-bounces(a)redhat.com [mailto:crash-utility-
> > bounces(a)redhat.com] On Behalf Of Dave Anderson
> > Sent: Thursday, August 12, 2010 6:14 AM
> > To: Discussion list for crash utility usage, maintenance and
> > development
> > Subject: Re: [Crash-utility] crash: invalid structure member offset
> >
> >
> > ----- "Reinoud Koornstra" <koornstra(a)hp.com> wrote:
> >
> > > Hi Everyone,
> > >
> > > I am trying to read a core file into crash, but I've got bad luck
> as
> > you can see below.
> > > Is core file corrupt? It is a vmcore file from a 32 bits kernel
> that
> > > was compiled with PAE, could that have corrupted things?
> > > Any hints here?
> > > Thanks,
> > >
> > > Reinoud.
> > >
> > > $ crash System.map-2.6.27 ./vmlinux-2.6.27 ./vmcore
> > >
> > > crash 4.0-3.7
> >
> > I don't know if the vmcore is corrupt, but PAE wouldn't be an
> issue.
> >
> > However, you are running a version of crash that was released
> almost
> > 4 years ago (13-Oct-2006) against a two-year-old kernel that was
> > released 15-Oct-2008. That's pretty much a guarantee of failure.
> >
> > Try updating to version 5.0.6 and see what happens.
> >
> > And BTW, if the vmlinux file is the exact same kernel as the
> > one that generated the vmcore file, you don't need a System.map
> > argument.
> >
> > Dave
> >
> >
> >
> > 15-Oct-2008
> >
> > > Copyright 2002, 2003, 2004, 2005, 2006 Red Hat, Inc.
> > > Copyright 2004, 2005, 2006 IBM Corporation
> > > Copyright 1999-2006 Hewlett-Packard Co
> > > Copyright 2005 Fujitsu Limited
> > > Copyright 2005 NEC Corporation
> > > Copyright 1999, 2002 Silicon Graphics, Inc.
> > > Copyright 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
> > > This program is free software, covered by the GNU General Public
> > License,
> > > and you are welcome to change it and/or distribute copies of it
> under
> > > certain conditions. Enter "help copying" to see the conditions.
> > > This program has absolutely no warranty. Enter "help warranty"
> for
> > > details.
> > >
> > > GNU gdb 6.1
> > > Copyright 2004 Free Software Foundation, Inc.
> > > GDB is free software, covered by the GNU General Public License,
> and
> > you are
> > > welcome to change it and/or distribute copies of it under certain
> > conditions.
> > > Type "show copying" to see the conditions.
> > > There is absolutely no warranty for GDB. Type "show warranty"
> for
> > details.
> > > This GDB was configured as "i686-pc-linux-gnu"...
> > >
> > > please wait... (gathering kmem slab cache data)
> > >
> > > crash: invalid structure member offset: kmem_cache_s_c_num
> > > FILE: memory.c LINE: 6891 FUNCTION: kmem_cache_init()
> > >
> > > [/usr/bin/crash] error trace: 80827a9 => 8095398 => 80aa7ef =>
> > > 8131e88
> > > /usr/bin/nm: /usr/bin/crash: no symbols
> > > /usr/bin/nm: /usr/bin/crash: no symbols
> > > /usr/bin/nm: /usr/bin/crash: no symbols
> > > /usr/bin/nm: /usr/bin/crash: no symbols
> > >
> > > WARNING: Because this kernel was compiled with gcc version 4.1.2,
> > certain
> > > commands or command options may fail unless crash is
> invoked
> > with
> > > the "--readnow" command line option.
> >
> > --
> > Crash-utility mailing list
> > Crash-utility(a)redhat.com
> > https://www.redhat.com/mailman/listinfo/crash-utility
>
> --
> Crash-utility mailing list
> Crash-utility(a)redhat.com
> https://www.redhat.com/mailman/listinfo/crash-utility
14 years, 3 months
Re: [Crash-utility] crash: invalid structure member offset
by Dave Anderson
----- "Reinoud Koornstra" <koornstra(a)hp.com> wrote:
> Hi Everyone,
>
> I am trying to read a core file into crash, but I've got bad luck as you can see below.
> Is core file corrupt? It is a vmcore file from a 32 bits kernel that
> was compiled with PAE, could that have corrupted things?
> Any hints here?
> Thanks,
>
> Reinoud.
>
> $ crash System.map-2.6.27 ./vmlinux-2.6.27 ./vmcore
>
> crash 4.0-3.7
I don't know if the vmcore is corrupt, but PAE wouldn't be an issue.
However, you are running a version of crash that was released almost
4 years ago (13-Oct-2006) against a two-year-old kernel that was
released 15-Oct-2008. That's pretty much a guarantee of failure.
Try updating to version 5.0.6 and see what happens.
And BTW, if the vmlinux file is the exact same kernel as the
one that generated the vmcore file, you don't need a System.map
argument.
Dave
15-Oct-2008
> Copyright 2002, 2003, 2004, 2005, 2006 Red Hat, Inc.
> Copyright 2004, 2005, 2006 IBM Corporation
> Copyright 1999-2006 Hewlett-Packard Co
> Copyright 2005 Fujitsu Limited
> Copyright 2005 NEC Corporation
> Copyright 1999, 2002 Silicon Graphics, Inc.
> Copyright 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
> This program is free software, covered by the GNU General Public License,
> and you are welcome to change it and/or distribute copies of it under
> certain conditions. Enter "help copying" to see the conditions.
> This program has absolutely no warranty. Enter "help warranty" for
> details.
>
> GNU gdb 6.1
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show warranty" for details.
> This GDB was configured as "i686-pc-linux-gnu"...
>
> please wait... (gathering kmem slab cache data)
>
> crash: invalid structure member offset: kmem_cache_s_c_num
> FILE: memory.c LINE: 6891 FUNCTION: kmem_cache_init()
>
> [/usr/bin/crash] error trace: 80827a9 => 8095398 => 80aa7ef =>
> 8131e88
> /usr/bin/nm: /usr/bin/crash: no symbols
> /usr/bin/nm: /usr/bin/crash: no symbols
> /usr/bin/nm: /usr/bin/crash: no symbols
> /usr/bin/nm: /usr/bin/crash: no symbols
>
> WARNING: Because this kernel was compiled with gcc version 4.1.2, certain
> commands or command options may fail unless crash is invoked with
> the "--readnow" command line option.
14 years, 3 months
Re: [Crash-utility] [RFC] gcore subcommand: a process coredump feature
by anderson@prospeed.net
>
> Hello,
>
> For some weeks I've developed gcore subcommand for crash utility which
> provides process coredump feature for crash kernel dump, strongly
> demanded by users who want to investigate user-space applications
> contained in kernel crash dump.
>
> I've now finished making a prototype version of gcore and found out
> what are the issues to be addressed intensely. Could you give me any
> comments and suggestions on this work?
Hello Daisuke,
As I mentioned in my previous email re: cpu numbering, I am currently
on vacation, and cannot spend much time looking at this issue until
I get back on August 9th.
However, I think that this could be a useful feature, and I did
take a quick look at how it could be done several months ago when
it was brought up on this mailing list. However, as you discovered,
I also noted that the user-space core dump code in the kernel has
undergone significant changes over time, and so the implemetation
by the crash utility would have to adapt to the kernel data structures
used by the various kernel versions. And because of that, I don't
want to put it into the base crash binary, but rather it should be
maintained as one or more extension modules, which can be located
in the "extensions" subdirectory in the crash source package, as well
as stored in the "extensions" web page link from the crash "people"
web site.
It is quite simple to re-adapt your patch as an extension module.
Check the "snap.c" and "snap.mk" files in the extensions subdirectory
as templates for your "gcore" command.
As to the other questions below, I will get back to you after
August 9th.
Thanks,
Dave
> Motivation
> ==========
>
> It's a relatively familiar technique that in a cluster system a
> currently running node triggers crash kernel dump mechanism when
> detecting a kind of a critical error in order for the running, error
> detecting server to cease as soon as possible. Concequently, the
> residual crash kernel dump contains a process image for the erroneous
> user application. At the case, developpers are interested in user
> space, rather than kernel space.
>
> There's also a merit of gcore that it allows us to use several
> userland debugging tools, such as GDB and binutils, in order to
> analyze user space memory.
>
>
> Current Status
> ==============
>
> I confirm the prototype version runs on the following configuration:
>
> Linux Kernel Version: 2.6.34
> Supporting Architecture: x86_64
> Crash Version: 5.0.5
> Dump Format: ELF
>
> I'm planning to widen a range of support as follows:
>
> Linux Kernel Version: Any
> Supporting Architecture: i386, x86_64 and IA64
> Dump Format: Any
>
>
> Issues
> ======
>
> Currently, I have issues below.
>
> 1) Retrieval of appropriate register values
>
> The prototype version retrieves register values from a _wrong_
> location: a top of the kernel stack, into which register values are
> saved at any preemption context switch. On the other hand, the
> register values that should be included here are the ones saved at
> user-to-kernel context switch on any interrupt event.
>
> I've yet to implement this. Specifically, I need to do the following
> task from now.
>
> (1) list all entries from user-space to kernel-space execution path.
>
> (2) divide the entries according to where and how the register
> values from user-space context are saved.
>
> (3) compose a program that retrieves the saved register values from
> appropriate locations that is traced by means of (1) and (2).
>
> Ideally, I think it's best if crash library provides any means of
> retrieving this kind of register values, that is, ones saved on
> various stack frames. Is there such a plan to do?
>
>
> 2) Getting a signal number for a task which was during core dump
> process at kernel crash
>
> If a target task is halfway of core dump process, it's better to know
> a signal number in order to know why the task was about to be core
> dumped.
>
> Unfortunately, I have no choice but backtrace the kernel stack to
> retrieve a signal number saved there as an argument of, for example,
> do_coredump().
>
>
> 3) Kernel version compatibility
>
> crash's policy is to support all kernel versions by the latest crash
> package. On the other hand, the prototype is based on kernel 2.6.34.
> This means more kernel versions need to be supported.
>
> Well, the question is: to what versions do I need to really test in
> addition to the latest upstream kernel? I think it's practically
> enough to support RHEL4, RHEL5 and RHEL6.
>
>
> Build Instruction
> =================
>
> $ tar xf crash-5.0.5.tar.gz
> $ cd crash-5.0.5/
> $ patch -p 1 < gcore.patch
> $ make
>
>
> Usage
> =====
>
> Use help subcommand of crash utility as ``help gcore''.
>
>
> Attached File
> =============
>
> * gcore.patch
>
> A patch implementing gcore subcommand for crash-5.0.5.
>
> The diffstat output is as follows.
>
> $ diffstat gcore.patch
> Makefile | 10 +-
> defs.h | 15 +
> gcore.c | 1858
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> gcore.h | 639 ++++++++++++++++++++
> global_data.c | 3 +
> help.c | 28 +
> netdump.c | 27 +
> tools.c | 37 ++
> 8 files changed, 2615 insertions(+), 2 deletions(-)
>
> --
> HATAYAMA Daisuke
> d.hatayama(a)jp.fujitsu.com
>
14 years, 3 months
[crash-utility] [lkcd-devel] Patch to add LKCD vmcore validation feature
by Vitaly Kuzmichev
Hello,
Attached is the patch to add separate tool for validating LKCD netdumps
and blockdumps.
We are planning to add this feature in our fork of crash-3.10.
Our customers requested this feature, but we have found that the 'crash'
does not print any warnings when someone tries to load incomplete
vmcore. They need a simple way to verify if core file generated from
LKCD is complete.
--
Best regards,
Vitaly Kuzmichev, Software Engineer,
MontaVista Software, LLC.
14 years, 3 months
Re: [Crash-utility] crash fails to start with RHEL4/ia64 vmcore
by Dave Anderson
----- "Mark Goodwin" <mgoodwin(a)redhat.com> wrote:
> [re-send now that I'm subscribed]
>
> Any help would be appreciated - I have an IA64 vmcore
> from a RHEL4 system, and crash doesn't like it :
>
> # file vmcore
> vmcore: ELF 64-bit LSB core file IA-64, version 1 (SYSV), SVR4-style, from 'vmlinux'
>
> # crash usr__2.6.9-89.EL/lib/debug/lib/modules/2.6.9-89.EL/vmlinux vmcore
>
> crash 5.0.6
> .. copyright stuff omitted ..
> GNU gdb (GDB) 7.0
> ...
> WARNING: invalid linux_banner pointer: 8a653075c6c36178
> crash: usr__2.6.9-89.EL/lib/debug/lib/modules/2.6.9-89.EL/vmlinux and vmcore do not match!
> Usage: ...
>
> I've tried hacking the 5.0.6 src to avoid the linux_banner check,
> but it just breaks further down the track :
>
> WARNING: boot command line memory limit: 59ab0c85d99b2fa9
> crash: ia64_VTOP(b929ee8856a6b891): unexpected region 5 address
>
> So I'm stuck. The customer assures me the vmcore is from their 2.6.9-89.EL
> ia64 system. At the very least it would help if I could extract some or all
> of the dmesg buffer.
It appears that there's simply a mismatch between the vmlinux and vmcore file.
If it's not a compressed diskdump vmcore, you can verify that easily enough by
just doing:
$ strings vmlinux | grep "Linux version"
$ strings vmcore | grep "Linux version"
and comparing the output.
> Dave, the vmcore is on an internal Red Hat RHEL4/ia64 system - I can
> give you access details off-list if you think it will help.
Sure -- if the strings output doesn't give you a clue, please point me to the files.
Thanks,
Dave
14 years, 3 months
crash fails to start with RHEL4/ia64 vmcore
by Mark Goodwin
[re-send now that I'm subscribed]
Any help would be appreciated - I have an IA64 vmcore
from a RHEL4 system, and crash doesn't like it :
# file vmcore
vmcore: ELF 64-bit LSB core file IA-64, version 1 (SYSV), SVR4-style, from 'vmlinux'
# crash usr__2.6.9-89.EL/lib/debug/lib/modules/2.6.9-89.EL/vmlinux vmcore
crash 5.0.6
.. copyright stuff omitted ..
GNU gdb (GDB) 7.0
...
WARNING: invalid linux_banner pointer: 8a653075c6c36178
crash: usr__2.6.9-89.EL/lib/debug/lib/modules/2.6.9-89.EL/vmlinux and vmcore do
not match!
Usage: ...
I've tried hacking the 5.0.6 src to avoid the linux_banner check,
but it just breaks further down the track :
WARNING: boot command line memory limit: 59ab0c85d99b2fa9
crash: ia64_VTOP(b929ee8856a6b891): unexpected region 5 address
So I'm stuck. The customer assures me the vmcore is from their 2.6.9-89.EL
ia64 system. At the very least it would help if I could extract some or all
of the dmesg buffer.
Dave, the vmcore is on an internal Red Hat RHEL4/ia64 system - I can
give you access details off-list if you think it will help.
Thanks
-- Mark Goodwin
14 years, 3 months
Re: [Crash-utility] Question on online/present/possible CPUS
by anderson@prospeed.net
>
> Hi all,
>
> before making a larger cleanup, I want to ask here for your opinion. It
> seems that there is quite a bit of confusion about the meaning of CPU
> count printed out by the crash utility.
>
> 1. Number of CPUs
>
> Some people think that crash should always output the number of CPUs in
> the system (ie. a quad-core server should always output 'CPUS: 4'),
> while other people think that only online CPUs should be counted.
>
> 2. CPU numbering
>
> For example, if there are 4 CPUs in the system, but some of them are
> taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the number of
> online CPUs, it would print out 'CPUS: 2'. It's not easy to find out
> that valid CPU numbers are 0 and 2 in this case.
Hi Petr,
For all but ppc64, the number shown by the initial banner and the
"sys" command is essentially "the-highest-cpu-number-plus-one".
For ppc64 (as requested and implemented by the IBM/ppc64 maintainers),
it shows the number of online cpus. There's reasons for doing it
either of the two ways, but I'm on vacation now, and you can research
the list archives for the various arguments for-and-against doing it
either way. Check the changelog.html for when it was changed for
ppc64, and then cross-reference the revision date with the list
archives.
> 3. Examining offline CPU
>
> Sometimes, it may be useful to examine the state of an offline CPU. Now,
> I know that the saved state is most likely stale, but it can be useful
> in some cases (e.g. a crash after dropping to kdb). The crash utility
> currently refuses to select an offline CPU with 'set -c #'. Are there
> any concerns about allowing it?
I tend to agree with you, but the only thing that's useful and
available from an offline cpu is the swapper task for that cpu
and the runqueue for that cpu. And both of those entities are
readily accessible if you really need them. Although I don't know
anything about kdb status, so maybe there's something of per-cpu
interest, but I don't know why it would be necessary to "set"
that cpu?
In any case, like I said before, I'm just temporarily online while
on vacation, and will be back to work on the 9th.
Thanks,
Dave
14 years, 3 months
Question on online/present/possible CPUs
by Petr Tesarik
Hi all,
before making a larger cleanup, I want to ask here for your opinion. It
seems that there is quite a bit of confusion about the meaning of CPU
count printed out by the crash utility.
1. Number of CPUs
Some people think that crash should always output the number of CPUs in
the system (ie. a quad-core server should always output 'CPUS: 4'),
while other people think that only online CPUs should be counted.
2. CPU numbering
For example, if there are 4 CPUs in the system, but some of them are
taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the number of
online CPUs, it would print out 'CPUS: 2'. It's not easy to find out
that valid CPU numbers are 0 and 2 in this case.
3. Examining offline CPU
Sometimes, it may be useful to examine the state of an offline CPU. Now,
I know that the saved state is most likely stale, but it can be useful
in some cases (e.g. a crash after dropping to kdb). The crash utility
currently refuses to select an offline CPU with 'set -c #'. Are there
any concerns about allowing it?
Regards,
Petr Tesarik
14 years, 3 months