On Wed, Oct 05, 2011 at 08:58:53AM -0700, Luck, Tony wrote:
> > The plan is to pass-down the list of poisoned memory pages
to the second
> > kernel using an elf-note so that these pages are left untouched during
> > dump capture. I'm working on an implementation of the same and should
> > have patches soon.
>
> I would say let us first figure out what happens while reading a poisoned
> page and is this a problem before working on a solution.
If the page is poisoned because of a real uncorrectable error in memory
(reported as SRAO machine check today, or by SRAR real-soon-now). Then
accessing the page from the processor while taking a memory dump will
result in a machine check.
Note that a large memory system that had been running for a long time
may have built up a small stash of these land-mine pages - and we need
to worry about them even in the case where the panic is not machine
check related (in fact especially in this case ... we are in a case
where we actually do want the dump to diagnose the cause of the panic,
and we don't want to risk losing the crash dump because we aborted when
touching a page that the OS had safely avoided for days/weeks/months).
So passing a list of poisoned pages from the old kernel to the new kernel
is a good idea - and is independent of the cause of the crash (except that
in the fatal machine check case due to memory error the list is guaranteed
to be non-empty).
Whre is this poisoned page info stored? In struct page? If yes, then
user space can walk through it and make sure not to touch poisoned pages.
Anyway user space filtering utility "makedumpfile" walks through struct
pages to filter out the pages. It should be able to filter out
poisoned pages unconditionally. So there should be no need for kernel
to export a list of these pages.
Thanks
Vivek