On Wed, Oct 05, 2011 at 12:37:28PM +0530, K.Prasad wrote:
> Well, there are MCE types for which we need to panic but we
don't
> necessarily corrupt memory. Your approach is to unconditionally avoid
> dumping core whenever we panic while you should look at the MCE
> signature and decide then whether to capture crashed kernel memory or
> not.
>
> For example, if the MCE signature says UC DRAM error, then you can
> be pretty sure that there is a landmine somewhere in the DRAM region
> mapping the crashed kernel. If it is, say, a UC when doing data fills
> from L2 to L1, that doesn't necessarily mean that DRAM is corrupted. But
> even in the first case, you can evaluate the MCi_ADDR reported with the
> UC DRAM error and simply skip that particular cacheline when dumping the
> core instead of not capturing anything at all.
>
True. Like stated by me earlier, there could be two possible outcomes
from capturing memory dump in such cases - they're either dangerous or
doesn't make sense.
Why, in the second example the only corruption is to the L2 cache so
your memory image is intact. Why wouldn't you want to capture a memory
dump then? It is business as usual in that case.
It is best to avoid a normal kdump in both cases,
although the elf-note doesn't distinguish between the two.
NT_NOCOREDUMP, in my opinion, is just the first step towards introducing
a framework where different code paths that lead to panic() can
'opt-out' from kdump by adding an elf-note.
We can modify this to add more fine-grained messages using different elf-note
types (or use the elf-note name under the NT_NOCOREDUMP type) to
indicate the cause/type of crash.
I'd like to hear further from you and the rest of the community to see if
there's a need felt for such a change.
I'd make this conditional on whether you have had memory corruption or
not by evaluating MCE signatures and acting accordingly.
> Btw, the doublefault example you give above - is this something
you
> experience on real hardware or just a theoretical thing?
>
Unfortunately, I still haven't been able to try injecting memory errors
and study the behaviour (trying to get access to machine with
appropriate firmware). I'll have a reply to this after some experiments
with memory error injection.
Right, this might be much more helpful than theoretical discussions on
what to do. :-)
Thanks.
--
Regards/Gruss,
Boris.