Re: [Crash-utility] determining a "valid" vmcore

Thursday, 7 February 2008

On Thu, 2008-02-07 at 14:40 -0500, Dave Anderson wrote:
...
 Andrew Hecox wrote:
 > On Thu, 2008-02-07 at 11:27 -0500, Dave Anderson wrote:
 >> Andrew Hecox wrote:
 >>> On Thu, 2008-02-07 at 10:32 -0500, Dave Anderson wrote:
 >>>> Andrew Hecox wrote:
 >>>>> hello,
 >>>>>
 >>>>> I'm looking at a customer issue where diskdumpmsg is unable to
read a
 >>>>> vmcore file. It is not clear if this a problem with the vmcore file
or
 >>>>> diskdumpmsg. I can load the vmcore with crash and in my naive usage
of
 >>>>> it, can see no problems. However, I'm new to the tool so that
doesn't
 >>>>> give me a lot of confidence. 
 >>>>>
 >>>>> Does anyone have any suggestions on how or if I can use crash to
help
 >>>>> determine if there's corruption in the vmcore file? Or any other
way of
 >>>>> approaching the problem? 
 >>>>>
 >>>>> Thanks much,
 >>>>>
 >>>>> Andrew
 >>>>>
 >>>> I'm not sure what you expect the crash utility to do -- if it comes
 >>>> up to a prompt with no error or warning messages, it means that the
 >>>> ELF header contains what appears to be valid usable information,
 >>>> and that the minimum kernel memory contents required to set up the
 >>>> crash utility's notion of the running system are all in place. 
That's
 >>>> not to say that there is no chance that the vmcore contains some
 >>>> corruption that was not recognized.
 >>>>
 >>> Thanks. Any other suggestions on how to determine if a vmcore is
"valid"
 >>> or is that not even a reasonable question to try and ask? The problem
 >>> I'm trying to solve is described better below:
 >>>
 >>>> With respect to diskdumpmsg, as I understand it, it was fairly recently
 >>>> changed from a perl script to a C file so that it could be run
 >>>> earlier in time so as to be able to use the swap partition.  Looking
 >>>> at main() in the diskdumpmsg.c file (version 1.4.1-2), there are
numerous
 >>>> error types and associated error messages.  What do you mean when you
 >>>> say that "diskdumpmsg is unable to read a vmcore file"?
 >>> Specifically: 
 >>>
 >>>  - user reported a floating point exception from diskdump on startup
 >>>  - the result was reproducible locally but only with their vmcore file
 >>>  - fpe occurred in get_logbuf:
 >>>                 log_end %= log_buf_len;
 >>>  - log_buf_len had been set to 0 in read_buffer
 >>>           if (!page_is_dumpable(pfn, dump->device)) {
 >>>               memset(buf, 0, copy_len);
 >>>           } else {
 >>>  - I don't know enough to say if the page really wasn't dumpable. 
 >>> static inline bool page_is_dumpable(unsigned int nr, DumpDevice *device)
 >>> {
 >>>   return device->dumpable_bitmap[nr>>3] & (1 << (nr &
7));
 >>> }
 >>>  - I wrote a patch with one way to avoid the FPE (attached) and sent it
 >>> to SEG.
 >>>
 >>> Now I'm trying to determine if the vmcore file should be readable by
 >>> diskdumpmsg. In other words, is this a problem in diskdumpmsg post-crash
 >>> or a problem with the vmcore file prior to it getting to diskdumpmsg.
 >>> Unfortunately, I don't understand the problem domain very well at all,
 >>> hence the probably naive questions :)
 >>>
 >>> Any suggestions are appreciated.
 >>>
 >>> -Andrew
 >> So it appears that the page containing the log_buf_len symbol is not
 >> readable or contained in the dumpfile.  BTW, is this a compressed
 >> dumpfile or an ELF formatted dumpfile?  And what "dump_level" did
 >> they configure?
 >>
 > 
 > compressed, level is 19.
 > 
 >> Anyway, back to the log_buf_len symbol read, what happens when you
 >> enter the "log" command while in a crash session?  It attempts to
 >> read that symbol immediately.
 >>
 > 
 > I get what appears to be a full and valid dump of the kernel message
 > buffer. 
 > 

 The crash utility has the same page_is_dumpable() function, which I presume
 looks at precisely the same bitmap data from the dumpfile.  And that
 must be working, given that the "log" command works as expected.

 One difference is that diskdumpmsg uses /boot/System.map-<release> for
 the symbol values, whereas crash uses the vmlinux file.  It might be
 of interest to determine whether the value of "log_buf_len" used by
 diskdumpmsg is the same symbol value as used by crash.

I get the same: 

(/boot/System.map-2.6.9-67.0.1.ELhugemem)

02323bd8 d log_buf_len

(/usr/lib/debug/lib/modules/2.6.9-67.0.1.ELhugemem/vmlinux)

$1 = (int *) 0x2323bd8

-Andrew

...
 Dave

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] determining a "valid" vmcore