Dne Út 10. ledna 2012 20:24:58 Dave Anderson napsal(a):
----- Original Message -----
> Hi folks,
>
> I've just discovered that the crash utility fails to initialize the vm
> subsystem properly on our latest SLES 32-bit kernels. It turns out that
> our kernels are compiled with CONFIG_DISCONTIGMEM=y, which causes pgdat
> structs to be allocated by the remap allocator (cf.
> arch/x86/mm/numa_32.c and also the code in setup_node_data).
>
> If you don't know what the remap allocator is (like I didn't before I hit
> the bug), it's a very special early-boot allocator which remaps physical
> pages from low memory to high memory, giving them virtual addresses from
> the
>
> identity mapping. Looks a bit like this:
> physical addr
>
> +------------+
>
> +------------+
>
> +--> | KVA RAM |
>
> | +------------+
> |
> | \/\/\/\/\/\/\/
> | /\/\/\/\/\/\/\
>
> virtual addr | | highmem |
>
> +------------+ | |------------|
>
> | | -----> | |
>
> +------------+ | +------------+
>
> | remap va | --+ | KVA PG | (unused)
>
> +------------+ +------------+
>
> | | -----> | RAM bottom |
>
> +------------+ +------------+
>
> This breaks a very basic assumption that crash makes about low-memory
> virtual addresses.
Hmmm, yeah, I am also unaware of this, and I'm not entirely clear based
upon your explanation. What do "KVA PG" and "KVA RAM" mean exactly?
And
do just the pgdat structures (which I know can be huge) get moved from low
to high physical memory (per-node perhaps), and then remapped with mapped
virtual addresses?
Well, the concept dates back to Martin Bligh's patch in 2002 which added this
for NUMA-Q. My understanding is that "KVA PG" refers to the kernel virtual
addresses used to access the pgdat array as well as to the physical memory
that corresponds to these virtual addresses if they were identity-mappe. This
physical memory is then inaccessible.
"KVA RAM", on the other hand, is where the pgdat structures are actually
stored. Please note that there is no "moving" of the structures, because this
remapping occurs when memory nodes are initialized, i.e. before any access to
it.
Regarding your second question, anything can theoretically call alloc_remap()
to allocate memory from this region, but nothing does, and by looking at
init_alloc_remap(), the size of the pool is always calculated as the size of
the pgdat array plus struct pglist_data, rounded up to a multiple of 2MB (so
that large pages can be used), so there's really only room for pgdat.
Anyway, I trust you know what you're doing...
Thank you for the trust.
> The attached patch fixes the issue for me, but may not be the
cleanest
> method to handle these mappings.
Anyway, what I can't wrap my head around is that the initialization
sequence is being done by the first call to x86_ktop_PAE(), which calls
x86_kvtop_remap(), which calls initialize_remap(), which calls readmem(),
which calls x86_kvtop_PAE(), starting the whole thing over again. How
does that recursion work? Would it be possible to call initialize_remap()
earlier on instead of doing it upon the first kvtop() call?
Agreed. My thinking was that each node has its own remap region, so I want to
know the number of nodes first. Since I didn't want to duplicate the
heuristics used to determine the number of nodes, I couldn't initialize before
vm_init. Then again, the remap mapping is accessed before vm_init() finishes.
I can see now that this is unnecessarily complicated, because the node_remap_*
variables are static arrays of MAX_NUMNODES elements, so I can get their size
from the debuginfo at POST_GDB init and initialize a machine-specific data
type with it. I'll post another patch tomorrow.
Thanks for the hint!
Petr Tesarik
SUSE Linux
> Ken'ichi Ohmichi, please note that makedumpfile is also
affected by this
> deficiency. On my test system, it will fail to produce any output if I
> set dump level to anything greater than zero:
>
> makedumpfile -c -d 31 -x vmlinux-3.0.13-0.5-pae.debug vmcore kdump.31
> readmem: Can't convert a physical address(34a012b4) to offset.
> readmem: type_addr: 0, addr:f4a012b4, size:4
> get_mm_discontigmem: Can't get node_start_pfn.
>
> makedumpfile Failed.
>
> However, fixing this for makedumpfile is harder, and it will most likely
> require a few more lines in VMCOREINFO, because debug symbols may not be
> available at dump time, and I can't see any alternative method to locate
> the remapped regions.
>
> Regards,
> Petr Tesarik
> SUSE Linux
--
Crash-utility mailing list
Crash-utility(a)redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility