Re: [Crash-utility] Handle the NT_PRSTATUS lost for the "bt" command

Monday, 18 June 2012

----- Original Message -----
...
 The purpose of this patch is to work out "bt" command for
the diskdump
 which NT_PRSTATUS note could not be saved by IPI lost.
 I think IPI is possibly lost by panic under the serious crashed condition.

 I noticed that "bt" failed in my ppc environment
 when the NT_PRSTATUS notes are lost on some CPUs while IPI delivery.
 Then, I made CPU map for prstatus in diskdump more correctable
 by checking a validation of crash_notes field.

 I've tested this problem by patching kernel like, kernel/kexec.c
 void crash_save_cpu(struct pt_regs *regs, int cpu)
 {
 +        if (current->pid == 0)
 +                /* this cpu was idle; nothing to capture */
 +                return;

 It looks terrible and impractical test case but actually I met this code
 in my using distro's kernel.  I couldn't reproduce actual IPI lost case, 
 then fortunately, use this as a example of the causes if IPI could not be
 delivered to other CPUs.

 => Taking diskdump by sysrq+c and makedumpfile.

 crash> help -D | grep notes
   num_prstatus_notes: 1
            notes_buf: 10ba91a8
             notes[0]: 10ba91a8
 crash> help -k | grep cpus
           cpus: 8
  cpus_override: (null)
 crash> bt
 PID: 1001   TASK: ea62b000  CPU: 2   COMMAND: "bash"
 Segmentation fault

 Since seven idle cpus did not save NT_PRSTATUS note,
 crash could not handle CPU#2's note where is located as CPU#0's.

 With this patch, crash get to work out with correct CPU map to
 prstatus.

 WARNING: catch lost crash_notes at cpu#0
 WARNING: catch lost crash_notes at cpu#1
 WARNING: catch lost crash_notes at cpu#3
 WARNING: catch lost crash_notes at cpu#4
 WARNING: catch lost crash_notes at cpu#5
 WARNING: catch lost crash_notes at cpu#6
 WARNING: catch lost crash_notes at cpu#7
 crash.fix> help -D | grep notes
   num_prstatus_notes: 1
            notes_buf: 107a3378
             notes[2]: 107a3378
 crash.fix> help -k | grep cpus
           cpus: 8
  cpus_override: (null)
 crash.fix> bt
 PID: 1001   TASK: ea62b000  CPU: 2   COMMAND: "bash"

 R0:  00000001   R1:  eb793e60   R2:  ea62b000   R3:  00000063
 R4:  00000000   R5:  ffffffff   R6:  c043ba2c   R7:  00000000
 R8:  00008000   R9:  00000000   R10: 00000000   R11: eb793e70
 R12: 28242444   R13: 100b8448   R14: 100b07b8   R15: 100b0894
 R16: 00000000   R17: 00000000   R18: 00000000   R19: 1006d270
 R20: 00000000   R21: 100f0430   R22: 00000000   R23: 00000001
 R24: c08f1ac8   R25: 00029002   R26: c08f1bac   R27: c08d0000
 R28: 00000000   R29: c09ada48   R30: 00000063   R31: eb793e60
 NIP: c0423378   MSR: 00021002   OR3: c09ada48   CTR: c0423344
 LR:  c0423d8c   XER: 00000000   CCR: 28242444   MQ:  00008000
 DAR: 00000000 DSISR: 00800000        Syscall Result: eb793e60
  NIP [00000000c0423378] sysrq_handle_crash
  LR  [00000000c0423d8c] __handle_sysrq

  #0 [eb793e60] sysrq_handle_crash at c0423378
   : snip

 Thanks,
 Toshi 
Toshi,

I don't want to add any new initialization-time code -- especially if
it's related to the NT_PRSTATUS notes -- that can abort a crash session
unnecessarily.  In your new crash_was_lost_crash_note() function, there
are these two FAULT_ON_ERROR readmem() calls:

	readmem(symbol_value("crash_notes"), KVADDR, &crash_notes_ptr,
		sizeof(ulong), "crash_notes", FAULT_ON_ERROR);
and

	readmem(crash_notes_ptr, KVADDR, buf, SIZE(note_buf),
		"cpu crash_notes", FAULT_ON_ERROR);

Although they are highly unlikely to fail, can you please make 
both of them RETURN_ON_ERROR, and if the readmem() fails, have 
it bail out and return FALSE?  And then, if necessary, make any
adjustments to map_cpus_to_prstatus_kdump_cmprs() to handle that 
remote possibility.  You should be able to test it with your 
patched kernel.  

Also, I don't understand the wording of this error message
at the end of crash_was_lost_crash_note():

        error(WARNING, "catch lost crash_notes at cpu#%d\n", cpu);  

Can you re-word that?  The notes were not "lost", but rather were
"not saved" by the crashing system.

Lastly, in __diskdump_memory_dump(), you just skip the "lost"
notes sections:

        for (i = 0, j = 0; j < dd->num_prstatus_notes; i++) {
                if (dd->nt_prstatus_percpu[i] == NULL)
                        continue;
                fprintf(fp, "            notes[%d]: %lx\n",
                        i, (ulong)dd->nt_prstatus_percpu[i]);
                j++;
        }

Can you make it more obvious, say, by displaying something like:

      notes[6]: (not saved)

Thanks,
  Dave

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] Handle the NT_PRSTATUS lost for the "bt" command