On Thu, 2008-02-07 at 11:27 -0500, Dave Anderson wrote:
Andrew Hecox wrote:
> On Thu, 2008-02-07 at 10:32 -0500, Dave Anderson wrote:
>> Andrew Hecox wrote:
>>> hello,
>>>
>>> I'm looking at a customer issue where diskdumpmsg is unable to read a
>>> vmcore file. It is not clear if this a problem with the vmcore file or
>>> diskdumpmsg. I can load the vmcore with crash and in my naive usage of
>>> it, can see no problems. However, I'm new to the tool so that
doesn't
>>> give me a lot of confidence.
>>>
>>> Does anyone have any suggestions on how or if I can use crash to help
>>> determine if there's corruption in the vmcore file? Or any other way of
>>> approaching the problem?
>>>
>>> Thanks much,
>>>
>>> Andrew
>>>
>> I'm not sure what you expect the crash utility to do -- if it comes
>> up to a prompt with no error or warning messages, it means that the
>> ELF header contains what appears to be valid usable information,
>> and that the minimum kernel memory contents required to set up the
>> crash utility's notion of the running system are all in place. That's
>> not to say that there is no chance that the vmcore contains some
>> corruption that was not recognized.
>>
>
> Thanks. Any other suggestions on how to determine if a vmcore is "valid"
> or is that not even a reasonable question to try and ask? The problem
> I'm trying to solve is described better below:
>
>> With respect to diskdumpmsg, as I understand it, it was fairly recently
>> changed from a perl script to a C file so that it could be run
>> earlier in time so as to be able to use the swap partition. Looking
>> at main() in the diskdumpmsg.c file (version 1.4.1-2), there are numerous
>> error types and associated error messages. What do you mean when you
>> say that "diskdumpmsg is unable to read a vmcore file"?
>
> Specifically:
>
> - user reported a floating point exception from diskdump on startup
> - the result was reproducible locally but only with their vmcore file
> - fpe occurred in get_logbuf:
> log_end %= log_buf_len;
> - log_buf_len had been set to 0 in read_buffer
> if (!page_is_dumpable(pfn, dump->device)) {
> memset(buf, 0, copy_len);
> } else {
> - I don't know enough to say if the page really wasn't dumpable.
> static inline bool page_is_dumpable(unsigned int nr, DumpDevice *device)
> {
> return device->dumpable_bitmap[nr>>3] & (1 << (nr & 7));
> }
> - I wrote a patch with one way to avoid the FPE (attached) and sent it
> to SEG.
>
> Now I'm trying to determine if the vmcore file should be readable by
> diskdumpmsg. In other words, is this a problem in diskdumpmsg post-crash
> or a problem with the vmcore file prior to it getting to diskdumpmsg.
> Unfortunately, I don't understand the problem domain very well at all,
> hence the probably naive questions :)
>
> Any suggestions are appreciated.
>
> -Andrew
So it appears that the page containing the log_buf_len symbol is not
readable or contained in the dumpfile. BTW, is this a compressed
dumpfile or an ELF formatted dumpfile? And what "dump_level" did
they configure?
compressed, level is 19.
Anyway, back to the log_buf_len symbol read, what happens when you
enter the "log" command while in a crash session? It attempts to
read that symbol immediately.
I get what appears to be a full and valid dump of the kernel message
buffer.
-Andrew
Dave
>>
>> ------------------------------------------------------------------------
>>
>> diff -rupN diskdumputils-1.4.1.orig/diskdumpmsg.c
diskdumputils-1.4.1/diskdumpmsg.c
>> --- diskdumputils-1.4.1.orig/diskdumpmsg.c 2008-02-06 14:32:41.000000000 -0500
>> +++ diskdumputils-1.4.1/diskdumpmsg.c 2008-02-06 15:56:22.000000000 -0500
>> @@ -208,6 +208,10 @@ static int get_logbuf(DumpFile *dump, ch
>>
>> len = log_end;
>> } else {
>> + if (!log_buf_len) {
>> + ret = READ_ERROR_IN_DUMP_FILE;
>> + goto err;
>> + }
>> log_end %= log_buf_len;
>>
>> ret = read_buffer(dump, log_buf + log_end,