Hi Dave,
On 2/16/2018 4:18 PM, Dave Anderson wrote:
...
>>> OK, I understand your point. But what concerns me is
that the function's
>>> purpose is to absolutely identify whether the incoming page structure
>>> address
>>> is a correct page structure address. But if an invalid address gets
>>> passed
>>> into is_page_ptr(), your patch would take the invalid address, calculate
>>> an
>>> invalid "nr", and continue from there, right?
>
> Yes, if an invalid "nr" is the number where section does not exist,
> valid_section_nr() would return 0. Even if it is the number where section
> exists by accident, the invalid "addr" is not between mem_map and
> end_mem_map,
> or not page-aligned, because if so, it is a page structure address.
>
> Also without this patch, when an invalid address comes, the loop could tries
> many invalid "nr"s less than NR_MEM_SECTIONS().
>
> I hope this answers your concern..
>
>>
>> Another suggestion/question -- if is_page_ptr() is called with a NULL phys
>> argument (as is done most of the time), could it skip the "if
>> IS_SPARSEMEM()"
>> section at the top, and still utilize the part at the bottom, where it
>> walks
>> through the vt->node_table[x] array? I'm not sure about the
"ppend"
>> calculation
>> though -- even if there are holes in the node's address space, is it still
>> a
>> contiguous chunk of page structure addresses per-node?
>
> I'm still investigating and not sure yet, but I think that SPASEMEM uses
> mem_section instead of node_mem_map means page structures could be
> non-contignuous per-node according to architecture or condition.
>
> typedef struct pglist_data {
> ...
> #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
> struct page *node_mem_map;
>
> I'll continue to check it.
You are right, but in the case where pglist_data.node_mem_map does *not* exist,
the crash utility initializes each vt->node_table[node].mem_map with the node's
starting mem_map address by using the return value from phys_to_page() of the
node's starting physical address -- which uses the sparsemem functions.
The question is whether the current "ppend" calculation is correct for the
last
physical page in a node. If it is not correct, then perhaps an "mem_map_end"
value
can be added to the node_table structure, initialized by using phys_to_page() to get
the page address of the last physical address in the node. And then in that case, the
question is whether the mem_map range of virtual addresses are contiguous -- even if
there are holes in the mem_map virtual address range.
"node_size" is set to pglist_data.node_spanned_pages, which includes holes.
So I think that if VMEMMAP, which a page address is linear against its pfn,
the current "ppend" calculation is correct for the last page in a node.
But if not VMEMMAP, since there is no guarantee of the linearity, the
calculation could be incorrect.
I found an example with RHEL5:
crash> help -o
...
size_table:
page: 56
...
crash> kmem -n
NODE SIZE PGLIST_DATA BOOTMEM_DATA NODE_ZONES
0 524279 ffff810000014000 ffffffff804e1900 ffff810000014000
ffff810000014b00
ffff810000015600
ffff810000016100
MEM_MAP START_PADDR START_MAPNR
ffff8100007da000 0 0
ZONE NAME SIZE MEM_MAP START_PADDR START_MAPNR
0 DMA 4096 ffff8100007da000 0 0
1 DMA32 520183 ffff810000812000 1000000 4096
2 Normal 0 0 0 0
3 HighMem 0 0 0 0
-------------------------------------------------------------------
NR SECTION CODED_MEM_MAP MEM_MAP PFN
0 ffff810009000000 ffff8100007da000 ffff8100007da000 0
1 ffff810009000008 ffff8100007da000 ffff81000099a000 32768
2 ffff810009000010 ffff8100007da000 ffff810000b5a000 65536
3 ffff810009000018 ffff8100007da000 ffff810000d1a000 98304 <= there is a
4 ffff810009000020 ffff810008901000 ffff810009001000 131072 <= mem_map gap.
5 ffff810009000028 ffff810008901000 ffff8100091c1000 163840
:
14 ffff810009000070 ffff810008901000 ffff81000a181000 458752
15 ffff810009000078 ffff810008901000 ffff81000a341000 491520
crash>
In this case, the "ppend" will be
0xffff8100007da000 + (524279 * 56)
= 0xffff8100023d9e08
but it looks like the actual value is around 0xffff81000a501000.
And also, we can see the gap between NR=3 and 4. This means that if the
correct "mem_map_end" is added to the node_table structure, it would be
not enough to check whether an address is a page structure.
Thanks,
Kazuhito Hagio
Thanks,
Dave
>
> Thanks,
> Kazuhito Hagio
>
>>
>>>
>>> Dave
>>>
>>>>
>>>>>
>>>>> There is really no compelling reason that count_partial() absolutely
>>>>> *must* use
>>>>> is_page_ptr(), and so I'm thinking that perhaps you could come up
with a
>>>>> less
>>>>> heavy-handed method for simply testing whether a page.lru entry
points
>>>>> to
>>>>> another
>>>>> vmemmap'd page. Something along the lines of adding this for
>>>>> vmemmap-enabled kernels:
>>>>>
>>>>> #define IN_VMEMMAP_RANGE(page) ((page >= VMEMMAP_VADDR)
&& (page <=
>>>>> VMEMMAP_END))
>>>>>
>>>>> and then have count_partial() replace the is_page_ptr() call with
>>>>> another
>>>>> slub function that does something like this for vmemmap-enabled
kernels:
>>>>>
>>>>> (IN_VMMEMAP_RANGE(next) && accessible(next))
>>>>>
>>>>> Or instead of accessible(), it could read "next" as a
list_head with
>>>>> RETURN_ON_ERROR,
>>>>> and verify that next->prev points back to the current list_head.
>>>>>
>>>>> Non-vmemmap-enabled kernels could still use is_page_ptr().
>>>>>
>>>>> What do you think of doing something like that?
>>>>
>>>> Given possible compatibility issues you said, I think that the way you
>>>> suggested
>>>> might well be enough for now. I'll try a method like the above.
>>>>
>>>> Thanks,
>>>> Kazuhito Hagio
>>>>
>>>>>
>>>>> Dave
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> Hi,
>>>>>>
>>>>>> The "kmem -[sS]" commands can take several minutes to
complete with
>>>>>> the following conditions:
>>>>>> * The system has a lot of memory sections with
CONFIG_SPARSEMEM.
>>>>>> * The kernel uses SLUB and it has a very long partial slab
list.
>>>>>>
>>>>>> crash> kmem -s dentry | awk '{print
strftime("%T"), $0}'
>>>>>> 10:18:34 CACHE NAME OBJSIZE
ALLOCATED
>>>>>> TOTAL
>>>>>> SLABS SSIZE
>>>>>> 10:19:41 ffff88017fc78a00 dentry 192
9038949
>>>>>> 10045728
>>>>>> 239184 8k
>>>>>> crash> kmem -S dentry | bash -c 'cat >/dev/null ;
echo $SECONDS'
>>>>>> 334
>>>>>>
>>>>>> One of the causes is that is_page_ptr() in count_partial() checks
if
>>>>>> a given slub page address is a page struct by searching all
memory
>>>>>> sections linearly for the one which includes it.
>>>>>>
>>>>>> nr_mem_sections = NR_MEM_SECTIONS();
>>>>>> for (nr = 0; nr < nr_mem_sections ; nr++) {
>>>>>> if ((sec_addr = valid_section_nr(nr))) {
>>>>>> ...
>>>>>>
>>>>>> With CONFIG_SPARSEMEM{_VMEMMAP}, we can calculate the memory
section
>>>>>> which includes a page struct with its page.flags, or its address
and
>>>>>> VMEMMAP_VADDR. With this patch doing so, the computation amount
can be
>>>>>> significantly reduced in that case.
>>>>>>
>>>>>> crash> kmem -s dentry | awk '{print
strftime("%T"), $0}'
>>>>>> 10:34:55 CACHE NAME OBJSIZE
ALLOCATED
>>>>>> TOTAL
>>>>>> SLABS SSIZE
>>>>>> 10:34:55 ffff88017fc78a00 dentry 192
9038949
>>>>>> 10045728
>>>>>> 239184 8k
>>>>>> crash> kmem -S dentry | bash -c 'cat >/dev/null ;
echo $SECONDS'
>>>>>> 2
>>>>>>
>>>>>> This patch uses VMEMMAP_VADDR. It is not defined on PPC64, but it
looks
>>>>>> like PPC64 supports VMEMMAP flag and
machdep->machspec->vmemmap_base is
>>>>>> it, so this patch also defines it for PPC64. This might need some
help
>>>>>> from PPC folks.
>>>>>>
>>>>>> Signed-off-by: Kazuhito Hagio <k-hagio(a)ab.jp.nec.com>
>>>>>> ---
>>>>>> defs.h | 2 ++
>>>>>> memory.c | 15 +++++++++++++++
>>>>>> 2 files changed, 17 insertions(+)
>>>>>>
>>>>>> diff --git a/defs.h b/defs.h
>>>>>> index aa17792..84e68ca 100644
>>>>>> --- a/defs.h
>>>>>> +++ b/defs.h
>>>>>> @@ -3861,6 +3861,8 @@ struct efi_memory_desc_t {
>>>>>> #define IS_VMALLOC_ADDR(X)
machdep->machspec->is_vmaddr(X)
>>>>>> #define KERNELBASE machdep->pageoffset
>>>>>>
>>>>>> +#define VMEMMAP_VADDR (machdep->machspec->vmemmap_base)
>>>>>> +
>>>>>> #define PGDIR_SHIFT (machdep->pageshift +
(machdep->pageshift -3)
>>>>>> +
>>>>>> (machdep->pageshift - 2))
>>>>>> #define PMD_SHIFT (machdep->pageshift +
(machdep->pageshift -
>>>>>> 3))
>>>>>>
>>>>>> diff --git a/memory.c b/memory.c
>>>>>> index 0df8ecc..0696763 100644
>>>>>> --- a/memory.c
>>>>>> +++ b/memory.c
>>>>>> @@ -13348,10 +13348,25 @@ is_page_ptr(ulong addr, physaddr_t
*phys)
>>>>>> ulong nr_mem_sections;
>>>>>> ulong coded_mem_map, mem_map, end_mem_map;
>>>>>> physaddr_t section_paddr;
>>>>>> +#ifdef VMEMMAP_VADDR
>>>>>> + ulong flags;
>>>>>> +#endif
>>>>>>
>>>>>> if (IS_SPARSEMEM()) {
>>>>>> nr_mem_sections = NR_MEM_SECTIONS();
>>>>>> +#ifdef VMEMMAP_VADDR
>>>>>> + nr = nr_mem_sections;
>>>>>> + if (machdep->flags & VMEMMAP)
>>>>>> + nr = pfn_to_section_nr((addr - VMEMMAP_VADDR) / SIZE(page));
>>>>>> + else if (readmem(addr + OFFSET(page_flags), KVADDR,
&flags,
>>>>>> + sizeof(ulong), "page.flags",
RETURN_ON_ERROR|QUIET))
>>>>>> + nr = (flags >> (SIZE(page_flags)*8 - SECTIONS_SHIFT())
>>>>>> + & ((1UL << SECTIONS_SHIFT()) - 1));
>>>>>> +
>>>>>> + if (nr < nr_mem_sections) {
>>>>>> +#else
>>>>>> for (nr = 0; nr < nr_mem_sections ; nr++) {
>>>>>> +#endif
>>>>>> if ((sec_addr = valid_section_nr(nr))) {
>>>>>> coded_mem_map =
>>>>>> section_mem_map_addr(sec_addr);
>>>>>> mem_map =
>>>>>> sparse_decode_mem_map(coded_mem_map,
>>>>>> nr);
>>>>>> --
>>>>>> 1.8.3.1
>>>>>>
>>>>>> --
>>>>>> Crash-utility mailing list
>>>>>> Crash-utility(a)redhat.com
>>>>>>
https://www.redhat.com/mailman/listinfo/crash-utility
>>>>>>
>>>>>
>>>>> --
>>>>> Crash-utility mailing list
>>>>> Crash-utility(a)redhat.com
>>>>>
https://www.redhat.com/mailman/listinfo/crash-utility
>>>>>
>>>>
>>>> --
>>>> Crash-utility mailing list
>>>> Crash-utility(a)redhat.com
>>>>
https://www.redhat.com/mailman/listinfo/crash-utility
>>>>
>>>
>>> --
>>> Crash-utility mailing list
>>> Crash-utility(a)redhat.com
>>>
https://www.redhat.com/mailman/listinfo/crash-utility
>>>
>>
>> --
>> Crash-utility mailing list
>> Crash-utility(a)redhat.com
>>
https://www.redhat.com/mailman/listinfo/crash-utility
>>
>
> --
> Crash-utility mailing list
> Crash-utility(a)redhat.com
>
https://www.redhat.com/mailman/listinfo/crash-utility
>
--
Crash-utility mailing list
Crash-utility(a)redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility