Re: [Crash-utility] [Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump

Wednesday, 5 October 2011

On Tue, Oct 04, 2011 at 08:34:40AM +0200, Borislav Petkov wrote:
...
 On Mon, Oct 03, 2011 at 05:33:36PM +0530, K.Prasad wrote:
 > It's interesting...according to Intel's Software Developer Manual
 > (quoting from Volume 3A, Chapter 15), the MCIP bit in IA32_MCG_STATUS
 > MSR behaves as described below.
 > 
 > "MCIP (machine check in progress) flag, bit 2 Indicates (when set)
 > that a machine-check exception was generated. Software can set or clear this
 > flag. The occurrence of a second Machine-Check Event while MCIP is set will
 > cause the processor to enter a shutdown state."
 > 
 > While in do_machine_check function, we enter the panic path (for
 > unrecoverable errors) much before the IA32_MCG_STATUS MSR is reset and
 > this is likely to dangerous.
 > 
 > 911 void do_machine_check(struct pt_regs *regs, long error_code)
 > 912 {
 > .............
 > ................
 > 1055         if (no_way_out && tolerant < 3)
 > 1056                 mce_panic("Fatal machine check on current CPU",
final, msg);
 > .............
 > ................
 > 1073         mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
 > 1074 out:
 > 
 > It'd be interesting to know the type of memory error (as classified by
 > the processor) for which you're able to capture the memory dump.
 > Maybe a dump of the various MCE status registers (and struct mce) would
 > help us understand the behaviour on your system better.

 Well, there are MCE types for which we need to panic but we don't
 necessarily corrupt memory. Your approach is to unconditionally avoid
 dumping core whenever we panic while you should look at the MCE
 signature and decide then whether to capture crashed kernel memory or
 not.

 For example, if the MCE signature says UC DRAM error, then you can
 be pretty sure that there is a landmine somewhere in the DRAM region
 mapping the crashed kernel. If it is, say, a UC when doing data fills
 from L2 to L1, that doesn't necessarily mean that DRAM is corrupted. But
 even in the first case, you can evaluate the MCi_ADDR reported with the
 UC DRAM error and simply skip that particular cacheline when dumping the
 core instead of not capturing anything at all.

True. Like stated by me earlier, there could be two possible outcomes
from capturing memory dump in such cases - they're either dangerous or
doesn't make sense. It is best to avoid a normal kdump in both cases,
although the elf-note doesn't distinguish between the two.

NT_NOCOREDUMP, in my opinion, is just the first step towards introducing
a framework where different code paths that lead to panic() can
'opt-out' from kdump by adding an elf-note.

We can modify this to add more fine-grained messages using different elf-note
types (or use the elf-note name under the NT_NOCOREDUMP type) to
indicate the cause/type of crash.

I'd like to hear further from you and the rest of the community to see if
there's a need felt for such a change.

...
 Btw, the doublefault example you give above - is this something you
 experience on real hardware or just a theoretical thing?

Unfortunately, I still haven't been able to try injecting memory errors
and study the behaviour (trying to get access to machine with
appropriate firmware). I'll have a reply to this after some experiments
with memory error injection.

Thanks,
K.Prasad

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] [Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump