Re: [Crash-utility] help debug number of CPU detect failure

Thursday, 5 March 2020

On Thu, Mar 5, 2020 at 12:54 PM Dave Anderson <anderson(a)redhat.com&gt; wrote:
...

 > > I suspect that it's a problem with either the --kaslr offset and/or
 > > the phys_base value that you have used.
 >
 > Is there method to know or print kaslr & phy_base in a running Linux system?

 They are normally passed in the VMCOREINFO data that is contained in an ELF PT_NOTE
 in the dumpfile header.  For example, here's a dump of the normal VMCOREINFO data,
 where the phys_base and KASLR offsets are down near the bottom:

                       OSRELEASE=4.18.0-185.el8.x86_64
                       PAGESIZE=4096
                       SYMBOL(init_uts_ns)=ffffffffbd812540
                       SYMBOL(node_online_map)=ffffffffbda0f520
                       SYMBOL(swapper_pg_dir)=ffffffffbd80a000
                       SYMBOL(_stext)=ffffffffbc600000
                       SYMBOL(vmap_area_list)=ffffffffbd8d78b0
                       SYMBOL(mem_section)=ffff956a3ffd2000
                       LENGTH(mem_section)=2048
                       SIZE(mem_section)=16
                       OFFSET(mem_section.section_mem_map)=0
                       SIZE(page)=64
                       SIZE(pglist_data)=171968
                       SIZE(zone)=1472
                       SIZE(free_area)=88
                       SIZE(list_head)=16
                       SIZE(nodemask_t)=128
                       OFFSET(page.flags)=0
                       OFFSET(page._refcount)=52
                       OFFSET(page.mapping)=24
                       OFFSET(page.lru)=8
                       OFFSET(page._mapcount)=48
                       OFFSET(page.private)=40
                       OFFSET(page.compound_dtor)=16
                       OFFSET(page.compound_order)=17
                       OFFSET(page.compound_head)=8
                       OFFSET(pglist_data.node_zones)=0
                       OFFSET(pglist_data.nr_zones)=171232
                       OFFSET(pglist_data.node_start_pfn)=171240
                       OFFSET(pglist_data.node_spanned_pages)=171256
                       OFFSET(pglist_data.node_id)=171264
                       OFFSET(zone.free_area)=192
                       OFFSET(zone.vm_stat)=1296
                       OFFSET(zone.spanned_pages)=112
                       OFFSET(free_area.free_list)=0
                       OFFSET(list_head.next)=0
                       OFFSET(list_head.prev)=8
                       OFFSET(vmap_area.va_start)=0
                       OFFSET(vmap_area.list)=48
                       LENGTH(zone.free_area)=11
                       SYMBOL(log_buf)=ffffffffbd85b140
                       SYMBOL(log_buf_len)=ffffffffbd85b13c
                       SYMBOL(log_first_idx)=ffffffffbe319778
                       SYMBOL(clear_idx)=ffffffffbe319744
                       SYMBOL(log_next_idx)=ffffffffbe319768
                       SIZE(printk_log)=16
                       OFFSET(printk_log.ts_nsec)=0
                       OFFSET(printk_log.len)=8
                       OFFSET(printk_log.text_len)=10
                       OFFSET(printk_log.dict_len)=12
                       LENGTH(free_area.free_list)=5
                       NUMBER(NR_FREE_PAGES)=0
                       NUMBER(PG_lru)=5
                       NUMBER(PG_private)=12
                       NUMBER(PG_swapcache)=9
                       NUMBER(PG_swapbacked)=18
                       NUMBER(PG_slab)=8
                       NUMBER(PG_hwpoison)=22
                       NUMBER(PG_head_mask)=32768
                       NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE)=-129
                       NUMBER(HUGETLB_PAGE_DTOR)=2
                       NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE)=-257
    ===============>   NUMBER(phys_base)=16437477376
                       SYMBOL(init_top_pgt)=ffffffffbd80a000
                       NUMBER(pgtable_l5_enabled)=0
                       SYMBOL(node_data)=ffffffffbda0ad20
                       LENGTH(node_data)=1024
    ===============>   KERNELOFFSET=3b600000
                       NUMBER(KERNEL_IMAGE_SIZE)=1073741824
                       NUMBER(sme_mask)=0
                       CRASHTIME=1583350919

 But in your Azure-generated dumpfile, I note that each cpu's NT_PRSTATUS note
 contains junk data, and while does have a VMCOREINFO note, it contains this:

 Elf64_Nhdr:
                n_namesz: 11 ("VMCOREINFO")
                n_descsz: 42
                  n_type: 0 (unused)
                          FAKE1=IGNORE1
                          FAKE2=IGNORE2
                          FAKE3=IGNORE3

 So that's why you need to pass in the two arguments.

 Now, the crash utility should be able to be brought up successfully
 on a live system without passing the arguments.  And once you've done
 that, you could get the values like this:

   crash> help -m | grep phys_base
                   phys_base: 3d3c00000
   crash> help -k | grep relocate
         relocate: ffffffffc4a00000  (KASLR offset: 3b600000 / 950MB)
   crash>

 But since they change with each reboot, you would have to capture them
 while running on the live system, and save them somewhere for a subsequent
 crash.  So that goes back to my question -- how did you get the numbers
 that you used? 
The number I had got by simply grepping through coredump strings.
$ strings vm1_numa_4gb_5cpu.coredump | grep -v strings | grep
'KERNELOFFSET=\|NUMBER(phys_base)='

Machine is still running and I cross verified those numbers with crash
and those were correct.

crash> p vmcoreinfo_data+1600
$1 = (unsigned char *) 0xffff917d3cde1640
"poison)=22\nNUMBER(PG_head_mask)=32768\nNUMBER(PAGE_BUDDY_MAPCOUNT_VALUE)=-128\nNUMBER(HUGETLB_PAGE_DTOR)=2\nNUMBER(phys_base)=4355784704\nSYMBOL(init_top_pgt)=ffffffff82a0a000\nSYMBOL(node_data)=ffffffff82c5d780\nLENGTH(node_data)=1024\nKERNELOFFSET=600000\nNUMBER"...

Now it appears to me that something wrong in Azure generated dump file.

...

 Dave

 --
 Crash-utility mailing list
 Crash-utility(a)redhat.com
 https://www.redhat.com/mailman/listinfo/crash-utility

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] help debug number of CPU detect failure