Re: [Crash-utility] Question for LKCD maintainers - How about adding a debug flag to crash and only calling abort() if crash is started with '-d' flag provided?

Wednesday, 2 January 2008

Piet Delaney wrote:
...
 Dave Anderson wrote:
 
>Long after I stopped tinkering with the LKCD code in crash,
>changes were contributed to support physical memory zones
>in the LKCD dumpfile format.
 
 
 Hi Dave:
 
 That could easily have been me.    I added zone support to the
 LKCD kernel and lcrash code and then updated your crash code
 to support zones. I kinda recall LKCD not dumping in monotonically
 increasing order and my modifying your crash code to live with this
 new feature in the LKCD dumps. I was trying to get the LKCD folks into
 supporting crash in addition to lcrash but failed to get any support from
 Tom Morano or Matt Robinson. I didn't realize that I had broken crash
 with the zone changes and felt responsible to fix crash to deal with this
 change that I had made. I also like the crash interface over the lcrash
 interface. I proposed to Tom using the elf format like KEXEC uses but
 he didn't go for it. I don't know why we can't hid additional crash info
 into ELF files and maintain as much compatibility as possible.
 
 
 
>  Specifically there is this
>piece of save_offset() in lkcd_common.c:
>
>        /* find the zone */
>        for (ii=0; ii < lkcd->num_zones; ii++) {
>                if (lkcd->zones[ii].start == zone) {
>                        if (lkcd->zones[ii].pages[page].offset != 0) {
>                           if (lkcd->zones[ii].pages[page].offset !=
>off) {
>                                error(INFO, "conflicting page: zone
>%lld, "
>                                        "page %lld: %lld, %lld !=
>%lld\n",
>                                        (unsigned long long)zone,
>                                        (unsigned long long)page,
>                                        (unsigned long long)paddr,
>                                        (unsigned long long)off,
>                                        (unsigned long long) \
>                                           
>lkcd->zones[ii].pages[page].offset);
>                                abort();
>                           }
>                           ret = 0;
>                        } else {
>                           lkcd->zones[ii].pages[page].offset = off;
>                           ret = 1;
>                        }
>                        break;
>                }
>        }
 
 The printf looks a bit like my coding style, though I don't know
 why (I ?)  decided to abort() in this case. I suppose the idea is
 to look at the situation with gdb on the resulting core file.
 
 
 
>The call to abort() above kills the crash session, which is both
>annoying and unnecessary.
 
 Isn't it worth while to look at the core file to understand the reason
 for the abort() being called for? 
I would think so, but not by me -- the developers of this LKCD
off-shoot can debug their own stuff.

...
 
 
>I am seeing it in a customer dumpfile, who have their own dumping scheme
>that is based upon LKCD version 7.  I understand that this may be a
>problem with their LKCD port, but nonetheless, it's the only place in
>the crash utility that doesn't recover gracefully from dumpfile access
>errors.
>
>Anyway, I would like to either:
>
> 1. change the error(INFO...) to error(FATAL...) so that run-time
>    commands encountering this error will just fail, and the session
>    will return to the crash> prompt, or
> 2. return 0, so that a "seek error" can be subsequently displayed
>    by the readmem() command.
>
>Number 2 is preferable, because it yields more clues as to where the
>readmem() came from, but since I don't know much about the LKCD
>physical memory zones stuff, is there any reason that shouldn't
>be done?
 
 
 How about having a crash debug flag and only calling abort if the
 debug flag is set. You might print in the error message that the
 user can force a core dump by adding a '-d' flag on invocation of
 crash and sending you the core file. 
Regardless of the reason behind it, the whole point is that there
was no need to abort the crash session.  If the "missing" page was
crucial to the crash session being able to run, then crash would
die on its own terms.  There are no other abort() calls in the
crash sources.

But in this case, the page was unnecessary for analysis of
the problem.  But when some commands (I forget which -- certainly
"search" for example) bumped into the page, the session would
abort() and had to be started up again.

Anyway, the abort() call was removed in version 4.0-4.9:

   - Fix for LKCD dumpfile access failures that abort() the crash session
     after displaying an error message indicating a problem with physical
     memory zones in the dumpfile.  Without the patch, the crash session
     would end immediately after displaying an error message of the sort:
     "conflicting page: zone 0, page 0: 0, 177160130 != 65536".  That
     error message will now only be displayed if the crash debug mode is 1
     or more, a readmem() "seek error" will be displayed instead, and the
     session will return to the "crash>" prompt.  (anderson(a)redhat.com)

This was the patch:

--- lkcd_common.c       15 Nov 2007 15:44:38 -0000      1.29
+++ lkcd_common.c       19 Nov 2007 15:48:18 -0000      1.30
@@ -708,14 +708,15 @@
                 if (lkcd->zones[ii].start == zone) {
                         if (lkcd->zones[ii].pages[page].offset != 0) {
                            if (lkcd->zones[ii].pages[page].offset != off) {
-                               error(INFO, "conflicting page: zone %lld, "
+                               if (CRASHDEBUG(1))
+                                   error(INFO, "LKCD: conflicting page: zone 
%lld, "
                                         "page %lld: %lld, %lld != %lld\n",
                                         (unsigned long long)zone,
                                         (unsigned long long)page,
                                         (unsigned long long)paddr,
                                         (unsigned long long)off,
                                         (unsigned long 
long)lkcd->zones[ii].pages[page].offset);
-                               abort();
+                               return -1;
                            }
                            ret = 0;
                         } else {

With respect to the -d flag suggestion, if you want to drop core
then you can set the internal crash "core" variable to "on", which
which will force a segmentation violation after printing the next
error message:

   crash> set core
   core: off (do NOT drop core on error message)
   crash> set core on
   core: on (drop core on error message)
   crash>

And then run the command that generates the error, say for
example, reading a non-existent physical address:

   crash> rd -p deadbeef
   [./crash] error trace: 8095503 => 8095799 => 8096ab4 => 808879c
   rd: read error: physical address: deadbeef  type: "32-bit PHYSADDR"

     808879c: __error+108
     8096ab4: readmem+1328
     8095799: display_memory+657
     8095503: cmd_rd+1558

   DROP_CORE flag set: forcing a segmentation fault
   Segmentation fault (core dumped)
   $

...
 
 
 While I've got your attention. I'm upgrading our 2.6.12-stable kernel to
 2.6.16-stable and want to start supporting core dumps. Ideally I'd like to
 have core dumps that are compatible with gdb and crash. Can crash
 handle the elf core files generated by KEXEC/KCORE. Last I thought
 about this I recall there being incompatibilities and it getting worse
 with kernels being compiled to be relocatable and kgdb having a problem
 because it wasn't aware of the relocation. 
By "KEXEC/KCORE" I'm presuming you mean "kexec/kdump", but
I'm
not sure what incompatibility you're referring to?

Maybe the workaround for x86 kernels whose CONFIG_PHYSICAL_START
contains a value that is greater then CONFIG_PHYSICAL_ALIGN:

   http://people.redhat.com/anderson/crash.changelog.html#4_0_4_5

Or maybe you're talking about 32-bit gdb not being able to handle
kdump-generated 64-bit ELF core files for 32-bit kernels?

Dave



    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] Question for LKCD maintainers - How about adding a debug flag to crash and only calling abort() if crash is started with '-d' flag provided?