Re: [Crash-utility] Crash faults when determining panic task

Thursday, 29 September 2011

----- Original Message -----
...
 Dave,

 Adding --no_elf_notes to the crash invocation does indeed start crash
 with without issue.  Do you think that I am dealing with a
 corrupted/incomplete vmcore (as evident in that extremely large n_descsz
 value) or is this a bug that crash could more gracefully handle? 
Hi Joe,

It should absolutely handle it more gracefully, but I'm not sure whether
the vmcore is corrupt.  It's difficult to debug this from afar, but hopefully
you can help me out a little bit.  (I'm also cc'ing the authors of this code
directly, to see if they can shed a little light on the matter.)

W/respect to your first patch that checks for a non-NULL bt->machdep
in x86_64_get_dumpfile_stack_frame(), can you tell me how that happened
exactly?  Before that function was called, it should have come through
get_netdump_regs_x86_64(), which would -- or would not -- have set
bt->machdep here:

        if (ELF_NOTES_VALID() &&
            (bt->flags & BT_DUMPFILE_SEARCH) && DISKDUMP_DUMPFILE()
&&
            (note = (Elf64_Nhdr *)
             diskdump_get_prstatus_percpu(bt->tc->processor))) {  
                user_regs = get_regs_from_note((char *)note, &rip, &rsp);

                if (CRASHDEBUG(1))
                        netdump_print("ELF prstatus rsp: %lx rip: %lx\n",
                                rsp, rip);

                *rspp = rsp;
                *ripp = rip;

                if (*ripp && *rspp)
                        bt->flags |= BT_KDUMP_ELF_REGS;

                bt->machdep = (void *)user_regs;
        }

If it did *not* set bt->machdep above, then it must have been because
diskdump_get_prstatus_percpu() below returned a NULL pointer?

void *
diskdump_get_prstatus_percpu(int cpu)
{
        return dd->nt_prstatus_percpu[cpu];
}

If you bring up crash with at least debug level 1 like this:

 $ crash -d1 vmlinux vmcore

you will see a dump of the array of dd->nt_prstatus_percpu[] note
pointers.  Alternatively during run-time, you can see the same output
by entering "help -n". 

Can you confirm:

  (1) what "cpu" value was passed to the function (presumably it was
      legitimate), and 
  (2) whether dd->nt_prstatus_percpu[cpu] was NULL?

Secondly, w/respect to the bogus note->n_descsz value, was the note 
pointer containing it one of those listed in the dd->nt_prstatus[]
array?  If not, what was the "cpu" value passed to diskdump_get_prstatus_cpu()
that time?

And also, what is the output of:

  crash> help -k | grep _map:

On my workstation, I see this:

  crash> help -k | grep _map:
         cpu_possible_map: 0 1 2 3 4 5 6 7 
          cpu_present_map: 0 1 2 3 4 5 6 7 
           cpu_online_map: 0 1 2 3 4 5 6 7 
  crash>

I'm wondering if your dump shows a system with some of the lower
cpus taken offline?

Thanks,
  Dave

...

 As far as the kernel is concerned,
 2.6.32-131.0.15.el6.exp10.bz16586.x86_64 was a stock RH
 2.6.32-131.0.15
 with an added patch for handling an MD Raid bug (RHBZ-707268).  Stratus
 does load a driver to track dirty VM pages for harvesting purposes, but
 does not change general VM behavior.

 FWIW, this is the only vmcore that I've seen ELF note faulting or
 invalid section numbers.

 Thanks,

 -- Joe

 -----Original Message-----
 From: crash-utility-bounces(a)redhat.com
 [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
 Sent: Wednesday, September 28, 2011 5:15 PM
 To: Discussion list for crash utility usage,maintenance and
 development
 Subject: Re: [Crash-utility] Crash faults when determining panic task

 Hi Joe,

 It pretty clear it's due to this change in 5.1.5:

          - Implemented the capability of using the NT_PRSTATUS ELF note data
            that is saved in version 4 compressed kdump headers to determine the
            starting stack and instruction pointer hooks for x86 and x86_64
            backtraces when they cannot be determined in the traditional manners.
            (wang.chao(a)cn.fujitsu.com, wency(a)cn.fujitsu.com)

 What happens if you run it like so:

   $ crash --no_elf_notes vmlinux vmcore

 As far as this message:

   WARNING: sparsemem: invalid section number: 137438888923

 That should be outside the realm of Fujitsu's ELF notes patch.  Does this kernel
 have some kind of Stratus VM modification?

 Dave

 ----- Original Message -----
 > 
 > Crash faults when determining panic task
 > 
 > I have a vmcore generated on RHEL6.1 that newer versions of crash
 > have trouble analyzing (5.1.1-2.el6 seems to work ok) .
 > 
 > 
 > 
 > I can provide additional binary files if needed, just let me know
 > what convention best suits the list (ftp, private email attachment,
 > etc.)
 > 
 > 
 > 
 > Crash Version : OS: Result:
 > 
 > crash 5.1.8 Debian wheezy faults
 > 
 > crash 5.1.7-1.el6 RHEL6.2 Alpha faults
 > 
 > crash 5.1.1-2.el6 RHEL6.1 ok
 > 
 > 
 > Kernel:
 > 
 > 2.6.32-131.0.15.el6.exp10.bz16586.x86_64 ( 2.6.32-131.0.15 + a fix
 > for Red Hat bz - 707268)
 > 
 > 
 > Interesting warnings when starting crash:
 > 
 > WARNING: sparsemem: invalid section number: 137438888923
 > 
 > WARNING: sparsemem: invalid section number: 137438888923
 > 
 > 
 > First fault, null pointer deference:
 > 
 > please wait... (determining panic task)
 > 
 > Program received signal SIGSEGV, Segmentation fault.
 > 
 > x86_64_get_dumpfile_stack_frame (rsp=0x7fffffffcc58,
 > rip=0x7fffffffcc50,
 > 
 > bt_in=0x7fffffffcce0) at x86_64.c:4183
 > 
 > 4183 ur_rip = ULONG(user_regs +
 > 
 > (gdb) p user_regs
 > 
 > $1 = 0x0
 > 
 > 
 > Workaround, check that bt->machdep is not NULL:
 > 
 > diff -Nupr crash-5.1.8/x86_64.c crash-5.1.8.new/x86_64.c
 > 
 > --- crash-5.1.8/x86_64.c 2011-09-16 15:01:12.000000000 -0400
 > 
 > +++ crash-5.1.8.new/x86_64.c 2011-09-28 14:12:45.347188571 -0400
 > 
 > @@ -4178,7 +4178,7 @@ x86_64_get_dumpfile_stack_frame(struct b
 > 
 > goto skip_stage;
 > 
 > }
 > 
 > }
 > 
 > - } else if (ELF_NOTES_VALID()) {
 > 
 > + } else if (ELF_NOTES_VALID() && bt->machdep) {
 > 
 > user_regs = bt->machdep;
 > 
 > ur_rip = ULONG(user_regs +
 > 
 > OFFSET(user_regs_struct_rip));
 > 
 > 
 > Second fault, a curiously large n_descsz in elf note header:
 > 
 > please wait... (determining panic task)
 > 
 > Program received signal SIGSEGV, Segmentation fault.
 > 
 > get_regs_from_note (note=0xd26472 "\b", ip=0x7fffffffc4e0,
 > sp=0x7fffffffc4e8)
 > 
 > at netdump.c:2221
 > 
 > 2221 *sp = ULONG(user_regs + offset_sp);
 > 
 > (gdb) p *(Elf64_Nhdr *)note
 > 
 > $1 = {n_namesz = 8, n_descsz = 3438804992, n_type = 8}
 > 
 > 
 > Workaround, do not attempt reading registers from elf notes (this
 > chunk of code was not present in crash 5.1.1):
 > 
 > diff -Nupr crash-5.1.8/netdump.c crash-5.1.8.new/netdump.c
 > 
 > --- crash-5.1.8/netdump.c 2011-09-16 15:01:12.000000000 -0400
 > 
 > +++ crash-5.1.8.new/netdump.c 2011-09-28 14:14:43.687183734 -0400
 > 
 > @@ -2286,7 +2286,7 @@ get_netdump_regs_x86_64(struct bt_info *
 > 
 > 
 > 
 > bt->machdep = (void *)user_regs;
 > 
 > }
 > 
 > -
 > 
 > +#if 0
 > 
 > if (ELF_NOTES_VALID() &&
 > 
 > (bt->flags & BT_DUMPFILE_SEARCH) && DISKDUMP_DUMPFILE() &&
 > 
 > (note = (Elf64_Nhdr *)
 > 
 > @@ -2305,7 +2305,7 @@ get_netdump_regs_x86_64(struct bt_info *
 > 
 > 
 > 
 > bt->machdep = (void *)user_regs;
 > 
 > }
 > 
 > -
 > 
 > +#endif
 > 
 > machdep->get_stack_frame(bt, ripp, rspp); }
 > 
 > 
 > Given the warning messages at the beginning of the process, I'm sure
 > if I' m dealing with a corrupted or incomplete vmcore image. Let me
 > know what additional info could be useful if this seems worth
 > debugging further.
 > 
 > 
 > 
 > Thanks,
 > 
 > -- Joe Lawrence
 > --
 > Crash-utility mailing list
 > Crash-utility(a)redhat.com
 > https://www.redhat.com/mailman/listinfo/crash-utility
 > 

 --
 Crash-utility mailing list
 Crash-utility(a)redhat.com
 https://www.redhat.com/mailman/listinfo/crash-utility

 --
 Crash-utility mailing list
 Crash-utility(a)redhat.com
 https://www.redhat.com/mailman/listinfo/crash-utility

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] Crash faults when determining panic task