Re: [Crash-utility] Handle the NT_PRSTATUS lost for the "bt" command

Tuesday, 19 June 2012

----- Original Message -----
...
 Hi Dave,

 Thanks for your comments.

 (2012/06/19 1:01), Dave Anderson wrote:

 > I don't want to add any new initialization-time code -- especially if
 > it's related to the NT_PRSTATUS notes -- that can abort a crash session
 > unnecessarily.  In your new crash_was_lost_crash_note() function, there
 > are these two FAULT_ON_ERROR readmem() calls:
 > 
 > 	readmem(symbol_value("crash_notes"), KVADDR,&crash_notes_ptr,
 > 		sizeof(ulong), "crash_notes", FAULT_ON_ERROR);
 > and
 > 
 > 	readmem(crash_notes_ptr, KVADDR, buf, SIZE(note_buf),
 > 		"cpu crash_notes", FAULT_ON_ERROR);
 > 
 > Although they are highly unlikely to fail, can you please make
 > both of them RETURN_ON_ERROR, and if the readmem() fails, have
 > it bail out and return FALSE?

 I see, I'll make consideration about initialization-time rule from now,
 to keep crash session, should use readmem() with QUIET|RETURN_ON_ERROR.

 > And then, if necessary, make any adjustments to map_cpus_to_prstatus_kdump_cmprs()
 > to handle that remote possibility.  You should be able to test it with your
 > patched kernel.

 I'm not able to test for unexpected, deliberate readmem() failure
 because my core file contains "crash_notes" memory field...

 Although I've probably misunderstood your advice,
 I had better add more debugging messages into
 map_cpus_to_prstatus_kdump_cmprs() to handle conditions
 where imply "not saved NT_PRSTATUS" or "nt_prstatus_percpu[] is
 remapped".

 > Also, I don't understand the wording of this error message
 > at the end of crash_was_lost_crash_note():
 > 
 >          error(WARNING, "catch lost crash_notes at cpu#%d\n", cpu);
 > 
 > Can you re-word that?  The notes were not "lost", but rather were
 > "not saved" by the crashing system.

 I re-word all "lost" keywords, so too function name with "saved".
 And also make this warning valid only when CRASHDEBUG() is enable
 because we can check it later by using "help -D".

 > Lastly, in __diskdump_memory_dump(), you just skip the "lost"
 > notes sections:
 > 
 >          for (i = 0, j = 0; j<  dd->num_prstatus_notes; i++) {
 >                  if (dd->nt_prstatus_percpu[i] == NULL)
 >                          continue;
 >                  fprintf(fp, "            notes[%d]: %lx\n",
 >                          i, (ulong)dd->nt_prstatus_percpu[i]);
 >                  j++;
 >          }
 > 
 > Can you make it more obvious, say, by displaying something like:
 > 
 >        notes[6]: (not saved)

 Looks better, thanks for giving good hint.

 [updates by attached patch]

 - display messages only if CRASHDEBUG() >= 1
 crash -d 1
 => display about NT_PRSTATUS mapping messages as below.
 ----------------
 WARNING: cpu#0: not saved crash_notes
 WARNING: cpu#1: not saved crash_notes
 crash: NT_PRSTATUS: cpu#2 = 107a34a8 (remapped from cpu#0)
 WARNING: cpu#3: not saved crash_notes
 WARNING: cpu#4: not saved crash_notes
 WARNING: cpu#5: not saved crash_notes
 WARNING: cpu#6: not saved crash_notes
 WARNING: cpu#7: not saved crash_notes
 ----------------

 - help message is enhanced
 crash> help -D | grep notes
   num_prstatus_notes: 1
            notes_buf: 107a34a8
             notes[0]: (not saved)
             notes[1]: (not saved)
             notes[2]: 107a34a8
             notes[3]: (not saved)
             notes[4]: (not saved)
             notes[5]: (not saved)
             notes[6]: (not saved)
             notes[7]: (not saved)

 Thanks,
 Toshi 
OK, now I'm getting confused...

I note that currently __diskdump_memory_dump() does this:

        for (i = 0; i < dd->num_prstatus_notes; i++) {
                fprintf(fp, "            notes[%d]: %lx\n",
                        i, (ulong)dd->nt_prstatus_percpu[i]);
        }

But your patch does this:

        nrcpus = kt->kernel_NR_CPUS ? kt->kernel_NR_CPUS : NR_CPUS;

        for (i = 0; i < nrcpus; i++) {
                if (dd->nt_prstatus_percpu[i] == NULL) {
                        fprintf(fp,
                        "            notes[%d]: (not saved)\n",
                                i);
                        continue;
                }
                fprintf(fp, "            notes[%d]: %lx\n",
                        i, (ulong)dd->nt_prstatus_percpu[i]);
        }

Now surely we don't want to dump "NR_CPUS" notes pointers, 
unless there actually are that many cpus in the system.
For example, in RHEL6 sets CONFIG_NR_CPUS to 4096 for x86_64,
but rarely do we see a dumpfile with that many cpus.
So I would prefer that the for loop continue to be bounded
by "dd->num_prstatus_notes".

However, given that process_elf32_notes() and process_elf64_notes()
do a cursory test of each note by checking for the NT_PRSTATUS
type, I'm presuming that in your dumpfile, dd->num_prstatus_notes
must be equal to 1, correct?

  void
  process_elf64_notes(void *note_buf, unsigned long size_note)
  {
          Elf64_Nhdr *nt;
          size_t index, len = 0;
          int num = 0;

          for (index = 0; index < size_note; index += len) {
                  nt = note_buf + index;

                  if(nt->n_type == NT_PRSTATUS) {
                          dd->nt_prstatus_percpu[num] = nt;
                          num++;
                  }
                  len = sizeof(Elf64_Nhdr);
                  len = roundup(len + nt->n_namesz, 4);
                  len = roundup(len + nt->n_descsz, 4);
          }

          if (num > 0) {
                  pc->flags2 |= ELF_NOTES;
                  dd->num_prstatus_notes = num;
          }
          return;
  }

But then map_cpus_to_prstatus_kdump_cmprs() remaps the
notes without modifying the dd->num_prstatus_notes counter.

So then there's this:

  void *
  diskdump_get_prstatus_percpu(int cpu)
  {
          if ((cpu < 0) || (cpu >= dd->num_prstatus_notes))
                  return NULL;

          return dd->nt_prstatus_percpu[cpu];
  }

So in a case such as your example, where cpu 2 was the only cpu that saved 
its notes, the function above would incorrectly return NULL when called 
with diskdump_get_prstatus_percpu(2).

What am I missing?

Dave

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] Handle the NT_PRSTATUS lost for the "bt" command