----- Original Message -----
 Hi Dave,
 
 On 2/21/2018 4:14 PM, Dave Anderson wrote:
 > 
 > 
 > ----- Original Message -----
 >> Hi Dave,
 >>
 >> Thank you so much for your review!
 >>
 >> On 2/21/2018 11:41 AM, Dave Anderson wrote:
 >>>
 >>> Hi Kasuhito,
 >>>
 >>> Just as a follow-up review of this part of your original patch:
 >>>
 >>>   +#ifdef VMEMMAP_VADDR
 >>>   +               nr = nr_mem_sections;
 >>>   +               if (machdep->flags & VMEMMAP)
 >>>   +                       nr = pfn_to_section_nr((addr - VMEMMAP_VADDR) /
 >>>   SIZE(page));
 >>>   +               else if (readmem(addr + OFFSET(page_flags), KVADDR,
 >>>   &flags,
 >>>   +                       sizeof(ulong), "page.flags",
 >>>   RETURN_ON_ERROR|QUIET))
 >>>   +                       nr = (flags >> (SIZE(page_flags)*8 -
 >>>   SECTIONS_SHIFT())
 >>>   +                               & ((1UL << SECTIONS_SHIFT()) -
1));
 >>>   +
 >>>   +               if (nr < nr_mem_sections) {
 >>>   +#else
 >>>                   for (nr = 0; nr < nr_mem_sections ; nr++) {
 >>>   +#endif
 >>>   
 >>> One of my original concerns was associated with the
 >>> backwards-compatiblity
 >>> of the non-VMEMMAP page->flags usage, primarily because it has changed
 >>> over the years.  Perhaps the SECTIONS_SHIFT part has remained the same,
 >>> but depending upon its future stability in this function still worries
 >>> me.
 >>
 >> Yes, I understand the concern.
 >>
 >> The SECTIONS_SHIFT part is the same as the function page_to_section() in
 >> include/linux/mm.h.  I thought that if the values related to
 >> SECTIONS_SHIFT
 >> in kernel change, the crash utility will have to follow it regardless of
 >> this patch, because they are used commonly in crash.  But possible changes
 >> could not be limited to such values.
 > 
 > It's true that crash should follow the upstream values, but in the past
 > there have been periods of times when the MAX_PHYSMEM_BITS and
 > SECTION_SIZE_BITS
 > for different architectures changed upstream, but were not immediately
 > updated in
 > the crash utility.  And that was because crash still worked OK because most
 > systems did not have large enough memory for the changes to cause things to
 > fail.
 
 Thank you, I understood it more precisely.
 
 [...]
 > So this goes to the question as to whether is_page_ptr() should return TRUE
 > if the incoming page address is accessible(), but it references a physical
 > address
 > that does not exist.  With the current code, it verifies that it's a page
 > address,
 > and that the page address points to an actual physical memory page.  I
 > suggested
 > using accessible() on the page structure address, but that would not
 > necessarily
 > be accurate because theoretically a vmemmap'd/instantiated page of page
 > structures
 > could contain page structure addresses that do not reference physical
 > memory.
 > (e.g., if a hole started at a page address that's in the  middle of a
 > vmemmap'd
 > page of page structures)
 
 As to the last example (IIUC), I had confirmed that accessible() returned
 FALSE if a page address was in a hole like below on x86_64/RHEL7, so I was
 writing the prototype patch.
 
   crash> kmem -n
   ...
   NR      SECTION        CODED_MEM_MAP        MEM_MAP       PFN
   ...
   23  ffff88043ffd82e0  ffffea0000000000  ffffea0002e00000  753664  ## There
   is a
   32  ffff88043ffd8400  ffffea0000000000  ffffea0004000000  1048576 ## hole
   here.
   ...
   crash> kmem ffffea0002ffffc0  ## The last page struct of NR=23
   DEBUG: addr=0xffffea0002ffffc0 nr=23 accessible=1
   DEBUG: addr=0xffffea0002ffffc0 nr=23 accessible=1  ## in is_page_ptr()
   DEBUG: addr=0xffffea0002ffffc0 nr=23 accessible=1
   DEBUG: addr=0xffffea0002ffffc0 nr=23 accessible=1
         PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
   ffffea0002ffffc0  bffff000                0        0  1 1fffff00000000
   crash> kmem ffffea0003000000  ## A page address in a hole.
   kmem: WARNING: cannot make virtual-to-physical translation:
   ffffea0003000000
   DEBUG: addr=0xffffea0003000000 nr=24 accessible=0
   DEBUG: addr=0xffffea0003000000 nr=24 accessible=0  ## returned 0
   DEBUG: addr=0xffffea0003000000 nr=24 accessible=0
   ffffea0003000000: kernel virtual address not found in mem map
 
 I'm not sure whether there is the case that a page address does not
 reference physical memory except for the above.  But originally,
 since accessible() is readmem(), which actually reads a dump file,
 it could return FALSE also due to some errors quietly, and this could
 leads to a wrong judgement.
 
 And I had thought that if accessible() was valid for the page validity
 test, there was no need to calculate its section.  So could it remove
 the accessible() and continue to utilize the valid_section_nr() section
 for the test like this?:
 ---
 diff --git a/defs.h b/defs.h
 index aa17792..ab98cb7 100644
 --- a/defs.h
 +++ b/defs.h
 @@ -3335,6 +3335,9 @@ struct arm64_stackframe {
  #define VTOP(X)               x86_64_VTOP((ulong)(X))
  #define IS_VMALLOC_ADDR(X)    x86_64_IS_VMALLOC_ADDR((ulong)(X))
  
 +#define IS_VMEMMAP_PAGE_ADDR(X)  x86_64_IS_VMEMMAP_PAGE_ADDR((ulong)(X))
 +#define VMEMMAP_PAGE_TO_PFN(X)   (((ulong)(X) - VMEMMAP_VADDR) / SIZE(page))
 +
  /*
   * the default page table level for x86_64:
   *    4 level page tables
 @@ -5646,6 +5649,7 @@ void x86_64_dump_machdep_table(ulong);
  ulong x86_64_PTOV(ulong);
  ulong x86_64_VTOP(ulong);
  int x86_64_IS_VMALLOC_ADDR(ulong);
 +int x86_64_IS_VMEMMAP_PAGE_ADDR(ulong);
  void x86_64_display_idt_table(void);
  #define display_idt_table() x86_64_display_idt_table()
  long x86_64_exception_frame(ulong, ulong, char *, struct bt_info *, FILE *);
 diff --git a/memory.c b/memory.c
 index 0df8ecc..928c3c2 100644
 --- a/memory.c
 +++ b/memory.c
 @@ -13350,6 +13350,30 @@ is_page_ptr(ulong addr, physaddr_t *phys)
  	physaddr_t section_paddr;
  
  	if (IS_SPARSEMEM()) {
 +#ifdef IS_VMEMMAP_PAGE_ADDR
 +		if (machdep->flags & VMEMMAP) {
 +			if (IS_VMEMMAP_PAGE_ADDR(addr)) {
 +				nr = pfn_to_section_nr(VMEMMAP_PAGE_TO_PFN(addr));
 +				if ((sec_addr = valid_section_nr(nr))) {
 +					coded_mem_map = section_mem_map_addr(sec_addr);
 +					mem_map = sparse_decode_mem_map(coded_mem_map, nr);
 +					end_mem_map = mem_map + (PAGES_PER_SECTION() * SIZE(page));
 +
 +					if ((addr >= mem_map) && (addr < end_mem_map)) {
 +						if ((addr - mem_map) % SIZE(page))
 +							return FALSE;
 +						if (phys) {
 +							section_paddr = PTOB(section_nr_to_pfn(nr));
 +							pgnum = (addr - mem_map) / SIZE(page);
 +							*phys = section_paddr + ((physaddr_t)pgnum * PAGESIZE());
 +						}
 +						return TRUE;
 +					}
 +				}
 +			}
 +			return FALSE;
 +		}
 +#endif
  		nr_mem_sections = NR_MEM_SECTIONS();
  	        for (nr = 0; nr < nr_mem_sections ; nr++) {
  	                if ((sec_addr = valid_section_nr(nr))) {
 diff --git a/x86_64.c b/x86_64.c
 index 7449571..7240c5d 100644
 --- a/x86_64.c
 +++ b/x86_64.c
 @@ -1570,6 +1570,14 @@ x86_64_IS_VMALLOC_ADDR(ulong vaddr)
  		(vaddr >= VSYSCALL_START && vaddr < VSYSCALL_END));
  }
  
 +int
 +x86_64_IS_VMEMMAP_PAGE_ADDR(ulong vaddr)
 +{
 +	return ((machdep->flags & VMEMMAP) &&
 +		(vaddr >= VMEMMAP_VADDR && vaddr <= VMEMMAP_END) &&
 +		!((vaddr - VMEMMAP_VADDR) % SIZE(page)));
 +}
 +
  static int
  x86_64_is_module_addr(ulong vaddr)
  {
 ---
 
 > So if we don't want to change the functionality of is_page_ptr(), then maybe
 > the binary search would be a suitable compromise for both accuracy and speed
 > on extremely large systems?
 
 I have not considered it enough yet, but if all ranges of mem_sections are
 ascending for section numbers and vmemmap holes like above are to be managed
 well, it might be good.  (I'm guessing that the binary search might need an
 auxiliary array or something due to the vmemmap holes..)
 
 Thanks,
 Kazuhito Hagio 
This "#ifdef IS_VMEMMAP_PAGE_ADDR" patch is getting really ugly.  I've been
playing around with this, and this is my latest counter-proposal.
First, the mem_section numbers are ascending.  They may not necessarily start
with 0, and there may be holes, but they are ascending.  That being the case,
there is no need for is_page_ptr() to walk through NR_MEM_SECTIONS() worth
of entries, because there will be an ending number that's typically much
smaller.  Even on a 256GB dumpfile I have on hand, which has a NR_MEM_SECTIONS()
value of 524288, the largest valid section number is 2055. (What is the smallest
and largest number that you see on whatever large-memory system that you are 
testing with?)
In any case, let's store the largest section number during initialization in 
the vm_table, and use it as a delimeter in is_page_ptr().
diff --git a/defs.h b/defs.h
index aa17792..8768fd5 100644
--- a/defs.h
+++ b/defs.h
@@ -2369,6 +2369,7 @@ struct vm_table {                /* kernel VM-related data */
 		ulong mask;
 		char *name;
 	} *pageflags_data;
+	ulong max_mem_section_nr;
 };
 
 #define NODES                       (0x1)
diff --git a/memory.c b/memory.c
index 0df8ecc..6770937 100644
--- a/memory.c
+++ b/memory.c
@@ -255,7 +255,7 @@ static void PG_reserved_flag_init(void);
 static void PG_slab_flag_init(void);
 static ulong nr_blockdev_pages(void);
 void sparse_mem_init(void);
-void dump_mem_sections(void);
+void dump_mem_sections(int);
 void list_mem_sections(void);
 ulong sparse_decode_mem_map(ulong, ulong);
 char *read_mem_section(ulong);
@@ -13350,7 +13350,7 @@ is_page_ptr(ulong addr, physaddr_t *phys)
 	physaddr_t section_paddr;
 
 	if (IS_SPARSEMEM()) {
-		nr_mem_sections = NR_MEM_SECTIONS();
+		nr_mem_sections = vt->max_mem_section_nr+1;
 	        for (nr = 0; nr < nr_mem_sections ; nr++) {
 	                if ((sec_addr = valid_section_nr(nr))) {
 	                        coded_mem_map = section_mem_map_addr(sec_addr);
@@ -13668,6 +13668,7 @@ dump_vm_table(int verbose)
 	fprintf(fp, "   swap_info_struct: %lx\n", (ulong)vt->swap_info_struct);
 	fprintf(fp, "            mem_sec: %lx\n", (ulong)vt->mem_sec);
 	fprintf(fp, "        mem_section: %lx\n", (ulong)vt->mem_section);
+	fprintf(fp, " max_mem_section_nr: %ld\n", vt->max_mem_section_nr);
 	fprintf(fp, "       ZONE_HIGHMEM: %d\n", vt->ZONE_HIGHMEM);
 	fprintf(fp, "node_online_map_len: %d\n", vt->node_online_map_len);
 	if (vt->node_online_map_len) {
@@ -16295,8 +16296,8 @@ dump_memory_nodes(int initialize)
 		vt->numnodes = n;
 	}
 
-	if (!initialize && IS_SPARSEMEM())
-		dump_mem_sections();
+	if (IS_SPARSEMEM())
+		dump_mem_sections(initialize);
 }
 
 /*
@@ -17128,9 +17129,9 @@ pfn_to_map(ulong pfn)
 }
 
 void 
-dump_mem_sections(void)
+dump_mem_sections(int initialize)
 {
-	ulong nr,addr;
+	ulong nr, max, addr;
 	ulong nr_mem_sections;
 	ulong coded_mem_map, mem_map, pfn;
 	char buf1[BUFSIZE];
@@ -17140,6 +17141,15 @@ dump_mem_sections(void)
 
 	nr_mem_sections = NR_MEM_SECTIONS();
 
+	if (initialize) {
+		for (nr = max = 0; nr < nr_mem_sections ; nr++) {
+			if (valid_section_nr(nr))
+				max = nr;
+		}
+		vt->max_mem_section_nr = max;
+		return;
+	}
+
 	fprintf(fp, "\n");
 	pad_line(fp, BITS32() ? 59 : 67, '-');
         fprintf(fp, "\n\nNR  %s  %s  %s  PFN\n",
Now, with respect to the architecture-specific, VMEMMAP-only, part
that is of most interest to you, let's do it with an architecture
specific callback.  You can post it for x86_64, and the other architecture
maintainers can write their own version.  For example, add a new
callback function to the machdep_table structure, i.e., like this:
  struct machdep_table {
          ulong flags;
          ulong kvbase;
  ...
          void (*get_irq_affinity)(int);
          void (*show_interrupts)(int, ulong *);
+         int is_page_ptr(ulong, physaddr_t *);
  };
  
Write the x86_64_is_page_ptr() function that works for VMEMMAP
kernels, and returns FALSE otherwise.  And add the call to the top 
of is_page_ptr() like this:
  int
  is_page_ptr(ulong addr, physaddr_t *phys)
  {
          int n;
          ulong ppstart, ppend;
          struct node_table *nt;
          ulong pgnum, node_size;
          ulong nr, sec_addr;
          ulong nr_mem_sections;
          ulong coded_mem_map, mem_map, end_mem_map;
          physaddr_t section_paddr;
+	  if (machdep->is_page_ptr(addr, phys))
+		  return TRUE;
  
          if (IS_SPARSEMEM()) {
  ...
  
To avoid having to check whether the machdep->is_page_ptr function
exists, write a generic_is_page_ptr() function that just returns
FALSE, and initialize machdep->is_page_ptr to generic_is_page_ptr in
the setup_environment() function.  Later on, individual architectures
can overwrite it when machdep_init(SETUP_ENV) is called.
How's that sound?
Dave