Re: [Crash-utility] Crash faults when determining panic task

Friday, 30 September 2011

----- Original Message -----
...

 RE: [Crash-utility] Crash faults when determining panic task

 > It would interesting to find out what happened in the
 > x86_process_elf_notes() function. 
Thanks for your help debugging this -- the dumpfile contains pretty
much what I expected: 

 (1) a single NT_PRSTATUS note (n_type 1, n_descsz 336)
 (2) followed by the VMCOREINFO note (n_type 0, n_descsz 1392), and 
 (?) zero-filled dumpfile data (n_type 0, n_descsz 0) 

By comparison, if I add this debug printf() to x86_process_elf_notes()
and run it against an 8-way compressed kdump:

  --- diskdump.c  30 Sep 2011 15:09:56 -0000      1.39
  +++ diskdump.c  30 Sep 2011 18:58:28 -0000
  @@ -243,7 +243,7 @@
          for (tot = 0; tot < size_note; tot += len) {
                  if (machine_type("X86_64")) {
                          note64 = note_ptr + tot;
  -
  +  fprintf(fp, "n_type: %d n_descsz: %d\n", note64->n_type,
note64->n_descsz);
                          if (note64->n_type == NT_PRSTATUS) {
                                  dd->nt_prstatus_percpu[num] = note64;
                                  num++;

I see this:

  ...
  This program has absolutely no warranty.  Enter "help warranty" for details.

  n_type: 1 n_descsz: 336
  n_type: 1 n_descsz: 336
  n_type: 1 n_descsz: 336
  n_type: 1 n_descsz: 336
  n_type: 1 n_descsz: 336
  n_type: 1 n_descsz: 336
  n_type: 1 n_descsz: 336
  n_type: 1 n_descsz: 336
  n_type: 0 n_descsz: 1373
  GNU gdb (GDB) 7.0   
  Copyright (C) 2009 Free Software Foundation, Inc.
  ...

So there's no extra zero-filled dumpfile location that gets checked,
i.e., it cleanly works its way through the dumpfile's notes region.
I don't know why that's not true with your dumpfile.

And, as it turns out, the per-cpu readmem() complaint is perfectly
legitimate -- I see the same thing on the compressed kdump example
above.  It's just that the loop has gone beyond the end of the per-cpu
data -- in your case, it's trying to read non-existent per-cpu 
data for the non-existent cpu 16.  So that's not a problem...

I still don't understand why the dumpfile doesn't have the other
15 NT_PRSTATUS notes, but until that patch was added into crash-5.1.5,
we never cared, and it would never have been noticed.  When I accepted
that patch, I was apprehensive that something like this might happen,
which is why I insisted that they also add the "--no_elf_notes"
option as a pre-emptive workaround:

...

https://www.redhat.com/archives/crash-utility/2011-April/msg00030.html

 Finally, in the interest of paranoia, give the user the capability
 of *not* using this facility.  In main.c, create a "--no_elf_notes"
 option (similar to "--zero_excluded"), and have it set a NO_ELF_NOTES
 bit in the globally-accessible "*diskdump_flags".  
So anyway, that all being the case, and with the two patches applied,
we've pretty much solved your problem from the crash utility's
perspective.  Perhaps there's a kernel kdump or makedumpfile issue,
but that's beyond the scope of this mailing list.

Thanks again, 
  Dave

...

 *** Breakpoints in x86_process_elf_notes()...

 (gdb) break diskdump.c:245
 Breakpoint 1 at 0x52379b: file diskdump.c, line 245.
 (gdb) r

 Breakpoint 1, x86_process_elf_notes (note_ptr=0xd1e000,
 size_note=1780)
 at diskdump.c:245
 245 note64 = note_ptr + tot;
 (gdb) p *(Elf64_Nhdr *)(note_ptr + tot)
 $1 = {n_namesz = 5, n_descsz = 336, n_type = 1}
 (gdb) c
 Continuing.

 Breakpoint 1, x86_process_elf_notes (note_ptr=0xd1e000,
 size_note=1780)
 at diskdump.c:245
 245 note64 = note_ptr + tot;
 (gdb) p *(Elf64_Nhdr *)(note_ptr + tot)
 $2 = {n_namesz = 11, n_descsz = 1392, n_type = 0}
 (gdb) c
 Continuing.

 Breakpoint 1, x86_process_elf_notes (note_ptr=0xd1e000,
 size_note=1780)
 at diskdump.c:245
 245 note64 = note_ptr + tot;
 (gdb) p *(Elf64_Nhdr *)(note_ptr + tot)
 $3 = {n_namesz = 0, n_descsz = 0, n_type = 0}
 (gdb) c
 Continuing.

 >> crash: page excluded: kernel virtual address: ffffffff81bb3b00
 >> type:
 "cpu number (per_cpu)"
 >> crash: page excluded: kernel virtual address: ffffffff81bb3b00
 >> type:
 "cpu number (per_cpu)"
 > [snip]
 > loop in both functions -- can you dump out which cpu's
 > per-cpu data was inaccessible?

 (gdb) break memory.c:1976
 Breakpoint 1 at 0x4722ff: file memory.c, line 1976.
 (gdb) set arg -d1 vmlinux vmcore
 (gdb) r
 Breakpoint 1, readmem (addr=18446744071591115520, memtype=1,
 buffer=0x7fffffff5b5c, size=4, type=0x7c7744 "cpu number (per_cpu)",
 error_handle=6) at memory.c:1976
 1976 error(INFO, PAGE_EXCLUDED_ERRMSG, memtype_string(memtype, 0),
 addr, type);
 (gdb) up
 #1 0x00000000004e5871 in x86_64_get_smp_cpus () at x86_64.c:4674
 4674 if (!readmem(sp->value + kt->__per_cpu_offset[i],
 (gdb) p cpunumber
 $1 = 15
 (gdb) p cpus
 $2 = 16
 (gdb) p i
 $3 = 16
 (gdb) p/x kt->__per_cpu_offset[0]@17
 $4 = {0xffff880028200000, 0xffff880028240000, 0xffff880028280000,
 0xffff8800282c0000, 0xffff880287400000, 0xffff880287440000,
 0xffff880287480000, 0xffff8802874c0000, 0xffff880028300000,
 0xffff880028340000, 0xffff880028380000, 0xffff8800283c0000,
 0xffff880287500000, 0xffff880287540000, 0xffff880287580000,
 0xffff8802875c0000, 0xffffffff81ba6000}

 > Joe, do you know if the non-crashing cpus were in some kind of
 > bizarre state such that they would not respond to the shutdown NMI?
 > I suppose in that case, there would be only the one NT_PRSTATUS
 > note for the crashing cpu (plus the VMCOREINFO note).

 The other CPUs are almost all sitting idle, a few are running I/O.

 > In any case, so far I've got two patches queued to help address
 > the two segmentation violations generated by a scenario such as
 > this.

 Patches applied and verified no segmentation faults.

 I have uploaded this vmcore/vmlinux to our FTP site (details to come
 in private mail).

 Thanks,

 -- Joe Lawrence
 --
 Crash-utility mailing list
 Crash-utility(a)redhat.com
 https://www.redhat.com/mailman/listinfo/crash-utility

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] Crash faults when determining panic task