On Wed, Oct 05, 2011 at 03:17:27PM +0530, K.Prasad wrote:
We don't want to capture memory dump when the machine crashes due
to
faulty cache, because the end-user derives no benefit by receiving a
bulky vmcore and running crash analysis tools over them. Instead a
'slimdump' that contains a meaningful message about the origin of crash
(and which can be understood by his analysis tools) would be better, or
so I thought.
Ok, this makes sense, a meaningful message along with the MCE decoded
properly in userfriendly language so that one can understand why the
system has not captured vmcore.
There are possibly several hardware errors which cause system crash
and
the kdump would capture full vmcore, although it doesn't make sense (I
wouldn't have cared about the second example, you cited, if they did not
generate MCE, but a different exception). In an ideal situation, each of
these error paths would 'subscribe' to slimdump and add a meaningful
message in the NT_NOCOREDUMP note instead of letting the user-space copy
the old kernel memory.
Yep, I see.
Fine with me. I see that the various IA32_MCi_Status registers will
hold
information about the error and use that to classify MCEs.
I think the best way to go about is to retain NT_NOCOREDUMP for non-DRAM
errors also, but use the note-name field in the elf-note and distinguish the
various types of errors...say, by using names such as "PANIC_MCE_DRAM",
"PANIC_MCE_CACHE", etc (similar to the error codes described in the Intel
manual). The upstream tools like 'makedumpfile' and 'crash' will have to
be taught to parse the elf-note name and act accordingly.
Right, so Valdis had the right question in the other mail, let me
generalize it here: does it ever make sense to save vmcore on a hardware
error?
With DRAM errors, you probably could use the additional info coming with
the MCE do decode to the physical address and map back to the DIMM and
swap it. Any other use cases?
Thanks.
--
Regards/Gruss,
Boris.