On Mon, Oct 03, 2011 at 05:33:36PM +0530, K.Prasad wrote:
It's interesting...according to Intel's Software Developer
Manual
(quoting from Volume 3A, Chapter 15), the MCIP bit in IA32_MCG_STATUS
MSR behaves as described below.
"MCIP (machine check in progress) flag, bit 2 Indicates (when set)
that a machine-check exception was generated. Software can set or clear this
flag. The occurrence of a second Machine-Check Event while MCIP is set will
cause the processor to enter a shutdown state."
While in do_machine_check function, we enter the panic path (for
unrecoverable errors) much before the IA32_MCG_STATUS MSR is reset and
this is likely to dangerous.
911 void do_machine_check(struct pt_regs *regs, long error_code)
912 {
.............
................
1055 if (no_way_out && tolerant < 3)
1056 mce_panic("Fatal machine check on current CPU", final,
msg);
.............
................
1073 mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
1074 out:
It'd be interesting to know the type of memory error (as classified by
the processor) for which you're able to capture the memory dump.
Maybe a dump of the various MCE status registers (and struct mce) would
help us understand the behaviour on your system better.
Well, there are MCE types for which we need to panic but we don't
necessarily corrupt memory. Your approach is to unconditionally avoid
dumping core whenever we panic while you should look at the MCE
signature and decide then whether to capture crashed kernel memory or
not.
For example, if the MCE signature says UC DRAM error, then you can
be pretty sure that there is a landmine somewhere in the DRAM region
mapping the crashed kernel. If it is, say, a UC when doing data fills
from L2 to L1, that doesn't necessarily mean that DRAM is corrupted. But
even in the first case, you can evaluate the MCi_ADDR reported with the
UC DRAM error and simply skip that particular cacheline when dumping the
core instead of not capturing anything at all.
Btw, the doublefault example you give above - is this something you
experience on real hardware or just a theoretical thing?
Thanks.
--
Regards/Gruss,
Boris.