Hi Dave,
On 2/20/2018 11:32 AM, Dave Anderson wrote:
...
>>>> Another suggestion/question -- if is_page_ptr() is
called with a NULL
>>>> phys
>>>> argument (as is done most of the time), could it skip the "if
>>>> IS_SPARSEMEM()"
>>>> section at the top, and still utilize the part at the bottom, where it
>>>> walks
>>>> through the vt->node_table[x] array? I'm not sure about the
"ppend"
>>>> calculation
>>>> though -- even if there are holes in the node's address space, is it
>>>> still
>>>> a
>>>> contiguous chunk of page structure addresses per-node?
>>>
>>> I'm still investigating and not sure yet, but I think that SPASEMEM uses
>>> mem_section instead of node_mem_map means page structures could be
>>> non-contignuous per-node according to architecture or condition.
>>>
>>> typedef struct pglist_data {
>>> ...
>>> #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
>>> struct page *node_mem_map;
>>>
>>> I'll continue to check it.
>>
>> You are right, but in the case where pglist_data.node_mem_map does *not*
>> exist,
>> the crash utility initializes each vt->node_table[node].mem_map with the
>> node's
>> starting mem_map address by using the return value from phys_to_page() of
>> the
>> node's starting physical address -- which uses the sparsemem functions.
>>
>> The question is whether the current "ppend" calculation is correct for
the
>> last
>> physical page in a node. If it is not correct, then perhaps an
>> "mem_map_end" value
>> can be added to the node_table structure, initialized by using
>> phys_to_page() to get
>> the page address of the last physical address in the node. And then in
>> that case, the
>> question is whether the mem_map range of virtual addresses are contiguous
>> -- even if
>> there are holes in the mem_map virtual address range.
>
> "node_size" is set to pglist_data.node_spanned_pages, which includes
holes.
> So I think that if VMEMMAP, which a page address is linear against its pfn,
> the current "ppend" calculation is correct for the last page in a node.
> But if not VMEMMAP, since there is no guarantee of the linearity, the
> calculation could be incorrect.
>
> I found an example with RHEL5:
>
> crash> help -o
> ...
> size_table:
> page: 56
> ...
> crash> kmem -n
> NODE SIZE PGLIST_DATA BOOTMEM_DATA NODE_ZONES
> 0 524279 ffff810000014000 ffffffff804e1900 ffff810000014000
> ffff810000014b00
> ffff810000015600
> ffff810000016100
> MEM_MAP START_PADDR START_MAPNR
> ffff8100007da000 0 0
>
> ZONE NAME SIZE MEM_MAP START_PADDR START_MAPNR
> 0 DMA 4096 ffff8100007da000 0 0
> 1 DMA32 520183 ffff810000812000 1000000 4096
> 2 Normal 0 0 0 0
> 3 HighMem 0 0 0 0
>
> -------------------------------------------------------------------
>
> NR SECTION CODED_MEM_MAP MEM_MAP PFN
> 0 ffff810009000000 ffff8100007da000 ffff8100007da000 0
> 1 ffff810009000008 ffff8100007da000 ffff81000099a000 32768
> 2 ffff810009000010 ffff8100007da000 ffff810000b5a000 65536
> 3 ffff810009000018 ffff8100007da000 ffff810000d1a000 98304 <= there is a
> 4 ffff810009000020 ffff810008901000 ffff810009001000 131072 <= mem_map gap.
> 5 ffff810009000028 ffff810008901000 ffff8100091c1000 163840
> :
> 14 ffff810009000070 ffff810008901000 ffff81000a181000 458752
> 15 ffff810009000078 ffff810008901000 ffff81000a341000 491520
> crash>
>
> In this case, the "ppend" will be
>
> 0xffff8100007da000 + (524279 * 56)
> = 0xffff8100023d9e08
>
> but it looks like the actual value is around 0xffff81000a501000.
Right, I understand that the current "ppend" calculation wouldn't work.
> And also, we can see the gap between NR=3 and 4. This means that if the
> correct "mem_map_end" is added to the node_table structure, it would be
> not enough to check whether an address is a page structure.
Why? Wouldn't it still give us an ascending range of page structure addresses
on a per-node basis? (even if there was a physical and/or virtual memory hole?)
AFAICT, for each section NR, the MEM_MAP and PFN values always increment.
Sorry if I misunderstood something..
First, I assume that we are talking about the case of kernels with SPARSEMEM
and using the vm->numnodes loop after skipping the IS_SPARSEMEM() section.
The "mem_map_end" I mean here is the page address of the last physical
address in the node, and the example system has only one node. So I think
that the "kmem -n" output above suggests that it could return TRUE for an
incoming "addr" between the end of NR=3 and the start of NR=4, but it's
not a page address.
NR MEM_MAP
0 +---------+ ffff8100007da000 = nt->mem_map
: | pages.. | :
2 +---------+ ffff810000b5a000
3 +---------+ ffff810000d1a000
+---------+ ffff810000eda000 = ffff810000d1a000 + (32768 * 56)
| ??? | <-- for an "addr" here, it could returns TRUE.
4 +---------+ ffff810009001000
5 +---------+ ffff8100091c1000
: | pages.. | :
15 +---------+ ffff81000a341000
+---------+ ffff81000a501000 = nt->mem_map_end
Because of such mem_map holes in a node, I don't think that the vm->numnodes
loop could be utilized for kernels with SPARSEMEM as it is.
Is this "mem_map_end" different from the one you assumed?
Thanks,
Kazuhito Hagio