----- "Pavan Naregundi" <pavan(a)linux.vnet.ibm.com> wrote:
On Tue, 2010-04-20 at 09:14 -0400, Dave Anderson wrote:
> ----- "Pavan Naregundi" <pavan(a)linux.vnet.ibm.com> wrote:
>
> The cause for seek errors depends upon the type
> of dumpfile.
>
> You didn't mention which type of dumpfile the vmcore
> is, so I'll presume that it's either an ELF-format
> kdump or a compressed kdump created by makedumpfile.
>
> So presuming that it's a compressed kdump, the seek error
> most likely comes from here in read_diskdump() in diskdump.c:
>
> if ((pfn >= dd->header->max_mapnr) || !page_is_ram(pfn))
> return SEEK_ERROR;
>
> where the requested physical address pfn values are larger
> than the max_mapnr value advertised in the header.
>
> When you do any "crash -d# ...", the dumpfile header will
> be dumped first. What does that show?
>
> Dave
Dave,
Dumpfile is compressed kdump created by makedumpfile.
header shows the following values:
max_mapnr: 32768
block_shift: 16
Yes. Adding some debug printf's shows me that (pfn >=
dd->header->max_mapnr) fails.
For example: in the first seek error,
crash: seek error: kernel virtual address: c0000000af715480 type:
"kmem_cache buffer"
paddr: af715480 => pfn=44913
crash -d8 log:
http://pastebin.com/qrCvyPfR
Thanks..Pavan
OK, so the compressed dumpfile has exactly 32768 pages of physical
memory, or exactly 2GB. That being the case, the crash utility
will fail all readmem attempts above that value, and obviously
there is critical data above the artificial 2GB threshold.
The question at hand is why kdump is creating a truncated dumpfile
with a max_mapnr of 32768:
(1) makedumpfile determines the "max_mapnr" value based upon the
highest physical address found in any of the PT_LOAD segments
of the /proc/vmcore file on the secondary kernel.
(2) the /proc/vmcore PT_LOAD segments were pre-calculated during
the primary kernel's kdump initialization phase, based upon
the values found in the set of "/proc/device-tree/memory@xxx/reg"
files existing in the primary kernel, where the "xxx" is the
starting physical address of the memory region, and the "reg"
file in that directory contains the size of the memory region.
For whatever reason, those files showed a maximum of 2GB of
physical memory. (If you do not use makedumpfile, and then do
a "readelf -a" of the resultant vmcore file, you will see
the PT_LOAD segment values.)
Does the SLES11 vmlinux-2.6.32.10-0.4.99.25.62005-ppc64 kernel
contain this patch?:
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.gi...
author Brian King <brking(a)linux.vnet.ibm.com>
Mon, 19 Oct 2009 05:51:34 +0000 (05:51 +0000)
committer Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Fri, 30 Oct 2009 06:20:56 +0000 (17:20 +1100)
commit 8be8cf5b47f72096e42bf88cc3afff7a942a346c
tree 9adff0fa02123f48fbfa40abb55a5c01be8c2fa4
parent 6cff46f4bc6cc4a8a4154b0b6a2e669db08e8fd2
powerpc: Add kdump support to Collaborative Memory Manager
When running Active Memory Sharing, the Collaborative Memory Manager (CMM)
may mark some pages as "loaned" with the hypervisor. Periodically, the
CMM will query the hypervisor for a loan request, which is a single signed
value. When kexec'ing into a kdump kernel, the CMM driver in the kdump
kernel is not aware of the pages the previous kernel had marked as "loaned",
so the hypervisor and the CMM driver are out of sync. Fix the CMM driver
to handle this scenario by ignoring requests to decrease the number of loaned
pages if we don't think we have any pages loaned. Pages that are marked as
"loaned" which are not in the balloon will automatically get switched to
"active"
the next time we touch the page. This also fixes the case where totalram_pages
is smaller than min_mem_mb, which can occur during kdump.
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
I ask because we also have an outstanding bugzilla that exhibits similar
behavior, where an abnormally small ppc64 vmcore file gets created
because there was only a single /proc/device-tree/memory@0 directory
file that showed just a small subset of the total physical memory.
Typically there are many of those "memory@xxx" directories, but in
the failing scenario, there was only one /proc/device-tree/memory@0
directory.
Anyway, there's (unproven) speculation that the kernel patch above
is related to the problem.
In any case, unfortunately, there's nothing can be done from the crash
utility's perspective.
Dave