----- Original Message -----
 Hi Dave,
 
 On 2/16/2018 4:18 PM, Dave Anderson wrote:
 ...
 >>>> OK, I understand your point.  But what concerns me is that the
 >>>> function's
 >>>> purpose is to absolutely identify whether the incoming page structure
 >>>> address
 >>>> is a correct page structure address.  But if an invalid address gets
 >>>> passed
 >>>> into is_page_ptr(), your patch would take the invalid address,
calculate
 >>>> an
 >>>> invalid "nr", and continue from there, right?
 >>
 >> Yes, if an invalid "nr" is the number where section does not exist,
 >> valid_section_nr() would return 0.  Even if it is the number where section
 >> exists by accident, the invalid "addr" is not between mem_map and
 >> end_mem_map,
 >> or not page-aligned, because if so, it is a page structure address.
 >>
 >> Also without this patch, when an invalid address comes, the loop could
 >> tries
 >> many invalid "nr"s less than NR_MEM_SECTIONS().
 >>
 >> I hope this answers your concern..
 >>
 >>>
 >>> Another suggestion/question -- if is_page_ptr() is called with a NULL
 >>> phys
 >>> argument (as is done most of the time),  could it skip the "if
 >>> IS_SPARSEMEM()"
 >>> section at the top, and still utilize the part at the bottom, where it
 >>> walks
 >>> through the vt->node_table[x] array?  I'm not sure about the
"ppend"
 >>> calculation
 >>> though -- even if there are holes in the node's address space, is it
 >>> still
 >>> a
 >>> contiguous chunk of page structure addresses per-node?
 >>
 >> I'm still investigating and not sure yet, but I think that SPASEMEM uses
 >> mem_section instead of node_mem_map means page structures could be
 >> non-contignuous per-node according to architecture or condition.
 >>
 >>   typedef struct pglist_data {
 >>   ...
 >>   #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
 >>           struct page *node_mem_map;
 >>
 >> I'll continue to check it.
 > 
 > You are right, but in the case where pglist_data.node_mem_map does *not*
 > exist,
 > the crash utility initializes each vt->node_table[node].mem_map with the
 > node's
 > starting mem_map address by using the return value from phys_to_page() of
 > the
 > node's starting physical address -- which uses the sparsemem functions.
 >  
 > The question is whether the current "ppend" calculation is correct for
the
 > last
 > physical page in a node.   If it is not correct, then perhaps an
 > "mem_map_end" value
 > can be added to the node_table structure, initialized by using
 > phys_to_page() to get
 > the page address of the last physical address in the node.  And then in
 > that case, the
 > question is whether the mem_map range of virtual addresses are contiguous
 > -- even if
 > there are holes in the mem_map virtual address range.
 
 "node_size" is set to pglist_data.node_spanned_pages, which includes holes.
 So I think that if VMEMMAP, which a page address is linear against its pfn,
 the current "ppend" calculation is correct for the last page in a node.
 But if not VMEMMAP, since there is no guarantee of the linearity, the
 calculation could be incorrect.
 
 I found an example with RHEL5:
 
 crash> help -o
 ...
                     size_table:
                           page: 56
 ...
 crash> kmem -n
 NODE    SIZE      PGLIST_DATA       BOOTMEM_DATA       NODE_ZONES
   0    524279   ffff810000014000  ffffffff804e1900  ffff810000014000
                                                     ffff810000014b00
                                                     ffff810000015600
                                                     ffff810000016100
     MEM_MAP       START_PADDR  START_MAPNR
 ffff8100007da000       0            0
 
 ZONE  NAME         SIZE       MEM_MAP      START_PADDR  START_MAPNR
   0   DMA          4096  ffff8100007da000            0            0
   1   DMA32      520183  ffff810000812000      1000000         4096
   2   Normal          0                 0            0            0
   3   HighMem         0                 0            0            0
 
 -------------------------------------------------------------------
 
 NR      SECTION        CODED_MEM_MAP        MEM_MAP       PFN
  0  ffff810009000000  ffff8100007da000  ffff8100007da000  0
  1  ffff810009000008  ffff8100007da000  ffff81000099a000  32768
  2  ffff810009000010  ffff8100007da000  ffff810000b5a000  65536
  3  ffff810009000018  ffff8100007da000  ffff810000d1a000  98304   <= there is a
  4  ffff810009000020  ffff810008901000  ffff810009001000  131072  <= mem_map gap.
  5  ffff810009000028  ffff810008901000  ffff8100091c1000  163840
  :
 14  ffff810009000070  ffff810008901000  ffff81000a181000  458752
 15  ffff810009000078  ffff810008901000  ffff81000a341000  491520
 crash>
 
 In this case, the "ppend" will be
 
   0xffff8100007da000 + (524279 * 56)
   = 0xffff8100023d9e08
 
 but it looks like the actual value is around 0xffff81000a501000. 
Right, I understand that the current "ppend" calculation wouldn't work.
 And also, we can see the gap between NR=3 and 4.  This means that if
the
 correct "mem_map_end" is added to the node_table structure, it would be
 not enough to check whether an address is a page structure. 
Why?  Wouldn't it still give us an ascending range of page structure addresses
on a per-node basis?  (even if there was a physical and/or virtual memory hole?) 
AFAICT, for each section NR, the MEM_MAP and PFN values always increment.
Dave
 
 Thanks,
 Kazuhito Hagio
 
 > 
 > Thanks,
 >   Dave
 > 
 > 
 > 
 >>
 >> Thanks,
 >> Kazuhito Hagio
 >>
 >>>
 >>>>
 >>>> Dave
 >>>>
 >>>>>
 >>>>>>
 >>>>>> There is really no compelling reason that count_partial()
absolutely
 >>>>>> *must* use
 >>>>>> is_page_ptr(), and so I'm thinking that perhaps you could
come up with
 >>>>>> a
 >>>>>> less
 >>>>>> heavy-handed method for simply testing whether a page.lru entry
points
 >>>>>> to
 >>>>>> another
 >>>>>> vmemmap'd page.  Something along the lines of adding this
for
 >>>>>> vmemmap-enabled kernels:
 >>>>>>
 >>>>>>   #define IN_VMEMMAP_RANGE(page) ((page >= VMEMMAP_VADDR)
&& (page <=
 >>>>>>   VMEMMAP_END))
 >>>>>>
 >>>>>> and then have count_partial() replace the is_page_ptr() call
with
 >>>>>> another
 >>>>>> slub function that does something like this for vmemmap-enabled
 >>>>>> kernels:
 >>>>>>
 >>>>>>    (IN_VMMEMAP_RANGE(next) && accessible(next))
 >>>>>>
 >>>>>> Or instead of accessible(), it could read "next" as a
list_head with
 >>>>>> RETURN_ON_ERROR,
 >>>>>> and verify that next->prev points back to the current
list_head.
 >>>>>>
 >>>>>> Non-vmemmap-enabled kernels could still use is_page_ptr().
 >>>>>>
 >>>>>> What do you think of doing something like that?
 >>>>>
 >>>>> Given possible compatibility issues you said, I think that the way
you
 >>>>> suggested
 >>>>> might well be enough for now.  I'll try a method like the
above.
 >>>>>
 >>>>> Thanks,
 >>>>> Kazuhito Hagio
 >>>>>
 >>>>>>
 >>>>>> Dave
 >>>>>>
 >>>>>>
 >>>>>>       
 >>>>>>
 >>>>>> ----- Original Message -----
 >>>>>>> Hi,
 >>>>>>>
 >>>>>>> The "kmem -[sS]" commands can take several minutes
to complete with
 >>>>>>> the following conditions:
 >>>>>>>   * The system has a lot of memory sections with
CONFIG_SPARSEMEM.
 >>>>>>>   * The kernel uses SLUB and it has a very long partial slab
list.
 >>>>>>>
 >>>>>>>   crash> kmem -s dentry | awk '{print
strftime("%T"), $0}'
 >>>>>>>   10:18:34 CACHE            NAME                 OBJSIZE 
ALLOCATED
 >>>>>>>   TOTAL
 >>>>>>>   SLABS  SSIZE
 >>>>>>>   10:19:41 ffff88017fc78a00 dentry                   192   
9038949
 >>>>>>>   10045728
 >>>>>>>   239184     8k
 >>>>>>>   crash> kmem -S dentry | bash -c 'cat >/dev/null
; echo $SECONDS'
 >>>>>>>   334
 >>>>>>>
 >>>>>>> One of the causes is that is_page_ptr() in count_partial()
checks if
 >>>>>>> a given slub page address is a page struct by searching all
memory
 >>>>>>> sections linearly for the one which includes it.
 >>>>>>>
 >>>>>>>         nr_mem_sections = NR_MEM_SECTIONS();
 >>>>>>>         for (nr = 0; nr < nr_mem_sections ; nr++) {
 >>>>>>>                 if ((sec_addr = valid_section_nr(nr))) {
 >>>>>>>                         ...
 >>>>>>>
 >>>>>>> With CONFIG_SPARSEMEM{_VMEMMAP}, we can calculate the memory
section
 >>>>>>> which includes a page struct with its page.flags, or its
address and
 >>>>>>> VMEMMAP_VADDR. With this patch doing so, the computation
amount can
 >>>>>>> be
 >>>>>>> significantly reduced in that case.
 >>>>>>>
 >>>>>>>   crash> kmem -s dentry | awk '{print
strftime("%T"), $0}'
 >>>>>>>   10:34:55 CACHE            NAME                 OBJSIZE 
ALLOCATED
 >>>>>>>   TOTAL
 >>>>>>>   SLABS  SSIZE
 >>>>>>>   10:34:55 ffff88017fc78a00 dentry                   192   
9038949
 >>>>>>>   10045728
 >>>>>>>   239184     8k
 >>>>>>>   crash> kmem -S dentry | bash -c 'cat >/dev/null
; echo $SECONDS'
 >>>>>>>   2
 >>>>>>>
 >>>>>>> This patch uses VMEMMAP_VADDR. It is not defined on PPC64,
but it
 >>>>>>> looks
 >>>>>>> like PPC64 supports VMEMMAP flag and
machdep->machspec->vmemmap_base
 >>>>>>> is
 >>>>>>> it, so this patch also defines it for PPC64. This might need
some
 >>>>>>> help
 >>>>>>> from PPC folks.
 >>>>>>>
 >>>>>>> Signed-off-by: Kazuhito Hagio <k-hagio(a)ab.jp.nec.com>
 >>>>>>> ---
 >>>>>>>  defs.h   |  2 ++
 >>>>>>>  memory.c | 15 +++++++++++++++
 >>>>>>>  2 files changed, 17 insertions(+)
 >>>>>>>
 >>>>>>> diff --git a/defs.h b/defs.h
 >>>>>>> index aa17792..84e68ca 100644
 >>>>>>> --- a/defs.h
 >>>>>>> +++ b/defs.h
 >>>>>>> @@ -3861,6 +3861,8 @@ struct efi_memory_desc_t {
 >>>>>>>  #define IS_VMALLOC_ADDR(X)
machdep->machspec->is_vmaddr(X)
 >>>>>>>  #define KERNELBASE      machdep->pageoffset
 >>>>>>>  
 >>>>>>> +#define VMEMMAP_VADDR  
(machdep->machspec->vmemmap_base)
 >>>>>>> +
 >>>>>>>  #define PGDIR_SHIFT     (machdep->pageshift +
(machdep->pageshift
 >>>>>>>  -3)
 >>>>>>>  +
 >>>>>>>  (machdep->pageshift - 2))
 >>>>>>>  #define PMD_SHIFT       (machdep->pageshift +
(machdep->pageshift -
 >>>>>>>  3))
 >>>>>>>  
 >>>>>>> diff --git a/memory.c b/memory.c
 >>>>>>> index 0df8ecc..0696763 100644
 >>>>>>> --- a/memory.c
 >>>>>>> +++ b/memory.c
 >>>>>>> @@ -13348,10 +13348,25 @@ is_page_ptr(ulong addr, physaddr_t
*phys)
 >>>>>>>  	ulong nr_mem_sections;
 >>>>>>>  	ulong coded_mem_map, mem_map, end_mem_map;
 >>>>>>>  	physaddr_t section_paddr;
 >>>>>>> +#ifdef VMEMMAP_VADDR
 >>>>>>> +	ulong flags;
 >>>>>>> +#endif
 >>>>>>>  
 >>>>>>>  	if (IS_SPARSEMEM()) {
 >>>>>>>  		nr_mem_sections = NR_MEM_SECTIONS();
 >>>>>>> +#ifdef VMEMMAP_VADDR
 >>>>>>> +		nr = nr_mem_sections;
 >>>>>>> +		if (machdep->flags & VMEMMAP)
 >>>>>>> +			nr = pfn_to_section_nr((addr - VMEMMAP_VADDR) /
SIZE(page));
 >>>>>>> +		else if (readmem(addr + OFFSET(page_flags), KVADDR,
&flags,
 >>>>>>> +			sizeof(ulong), "page.flags",
RETURN_ON_ERROR|QUIET))
 >>>>>>> +			nr = (flags >> (SIZE(page_flags)*8 -
SECTIONS_SHIFT())
 >>>>>>> +				& ((1UL << SECTIONS_SHIFT()) - 1));
 >>>>>>> +
 >>>>>>> +		if (nr < nr_mem_sections) {
 >>>>>>> +#else
 >>>>>>>  	        for (nr = 0; nr < nr_mem_sections ; nr++) {
 >>>>>>> +#endif
 >>>>>>>  	                if ((sec_addr = valid_section_nr(nr))) {
 >>>>>>>  	                        coded_mem_map =
 >>>>>>>  	                        section_mem_map_addr(sec_addr);
 >>>>>>>  	                        mem_map =
 >>>>>>>  	                       
sparse_decode_mem_map(coded_mem_map,
 >>>>>>>  	                        nr);
 >>>>>>> --
 >>>>>>> 1.8.3.1
 >>>>>>>
 >>>>>>> --
 >>>>>>> Crash-utility mailing list
 >>>>>>> Crash-utility(a)redhat.com
 >>>>>>> 
https://www.redhat.com/mailman/listinfo/crash-utility
 >>>>>>>
 >>>>>>
 >>>>>> --
 >>>>>> Crash-utility mailing list
 >>>>>> Crash-utility(a)redhat.com
 >>>>>> 
https://www.redhat.com/mailman/listinfo/crash-utility
 >>>>>>
 >>>>>
 >>>>> --
 >>>>> Crash-utility mailing list
 >>>>> Crash-utility(a)redhat.com
 >>>>> 
https://www.redhat.com/mailman/listinfo/crash-utility
 >>>>>
 >>>>
 >>>> --
 >>>> Crash-utility mailing list
 >>>> Crash-utility(a)redhat.com
 >>>> 
https://www.redhat.com/mailman/listinfo/crash-utility
 >>>>
 >>>
 >>> --
 >>> Crash-utility mailing list
 >>> Crash-utility(a)redhat.com
 >>> 
https://www.redhat.com/mailman/listinfo/crash-utility
 >>>
 >>
 >> --
 >> Crash-utility mailing list
 >> Crash-utility(a)redhat.com
 >> 
https://www.redhat.com/mailman/listinfo/crash-utility
 >>
 > 
 > --
 > Crash-utility mailing list
 > Crash-utility(a)redhat.com
 > 
https://www.redhat.com/mailman/listinfo/crash-utility
 > 
 
 --
 Crash-utility mailing list
 Crash-utility(a)redhat.com
 
https://www.redhat.com/mailman/listinfo/crash-utility