Re: [Crash-utility] [Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump

Wednesday, 5 October 2011

On Wed, Oct 05, 2011 at 12:37:28PM +0530, K.Prasad wrote:
...
 > Well, there are MCE types for which we need to panic but we
don't
 > necessarily corrupt memory. Your approach is to unconditionally avoid
 > dumping core whenever we panic while you should look at the MCE
 > signature and decide then whether to capture crashed kernel memory or
 > not.
 > 
 > For example, if the MCE signature says UC DRAM error, then you can
 > be pretty sure that there is a landmine somewhere in the DRAM region
 > mapping the crashed kernel. If it is, say, a UC when doing data fills
 > from L2 to L1, that doesn't necessarily mean that DRAM is corrupted. But
 > even in the first case, you can evaluate the MCi_ADDR reported with the
 > UC DRAM error and simply skip that particular cacheline when dumping the
 > core instead of not capturing anything at all.
 > 

 True. Like stated by me earlier, there could be two possible outcomes
 from capturing memory dump in such cases - they're either dangerous or
 doesn't make sense. 
Why, in the second example the only corruption is to the L2 cache so
your memory image is intact. Why wouldn't you want to capture a memory
dump then? It is business as usual in that case.

...
 It is best to avoid a normal kdump in both cases,
 although the elf-note doesn't distinguish between the two.

 NT_NOCOREDUMP, in my opinion, is just the first step towards introducing
 a framework where different code paths that lead to panic() can
 'opt-out' from kdump by adding an elf-note.

 We can modify this to add more fine-grained messages using different elf-note
 types (or use the elf-note name under the NT_NOCOREDUMP type) to
 indicate the cause/type of crash.

 I'd like to hear further from you and the rest of the community to see if
 there's a need felt for such a change. 
I'd make this conditional on whether you have had memory corruption or
not by evaluating MCE signatures and acting accordingly.

...
 > Btw, the doublefault example you give above - is this something
you
 > experience on real hardware or just a theoretical thing?
 >

 Unfortunately, I still haven't been able to try injecting memory errors
 and study the behaviour (trying to get access to machine with
 appropriate firmware). I'll have a reply to this after some experiments
 with memory error injection. 
Right, this might be much more helpful than theoretical discussions on
what to do. :-)

Thanks.

-- 
Regards/Gruss,
    Boris.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] [Patch 1/4][kernel][slimdump] Add new elf-note of type NT_NOCOREDUMP to capture slimdump