sharyathi nagesh wrote:
Hi
I am seeing this problem with crash tool on a system with NUMA nodes.
crash exits with error message and no further analysis of dump is possible.
=====
Error message:
cassinilp1:~ # crash
crash 4.0-3.14
Copyright (C) 2002, 2003, 2004, 2005, 2006 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005 Fujitsu Limited
Copyright (C) 2005 NEC Corporation
Copyright (C) 1999, 2002 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "powerpc64-unknown-linux-gnu"...
crash: numnodes out of sync with pgdat_list?
=====
System configuration is given as
Node 0 Memory:
Node 1 Memory:
Node 2 Memory:
Node 3 Memory:
Node 4 Memory: 0x0-0x180000000
Node 0 CPUs: 0
Node 1 CPUs:
Node 2 CPUs:
Node 3 CPUs:
Node 4 CPUs: 1
=====
The problem is noticed because of mismatch:
if (n != vt->numnodes)
error(FATAL, "numnodes out of sync with pgdat_list?\n");
in memory.c/dump_memory_nodes() function
The problem is because of the mismatch between node_online_map and the number of
nodes observed by traversing through pgdat_list.
node_online_map bit is set differently in kernel version 2.6.16 and 2.6.19.
In earlier version all the bits from the first bit to
nth bit, where n is last Node to which memory is assigned is set to '1'.
But in later version node is considered online if either memory or cpu is
allocated (or both).
So I need your suggestion on how to go and fix the problem
A few ideas I had were
1) If KERNEL_VERSION <= 2.6.16 set increment vt->numnodes only if bits of
node_online_map and cpu_online_map are set.
if KERNEL_VERSIOn > 2.6.16 use only node_online_map
(This will partly solve the problem)
2) or as in node_table_init(). Raise the error only when CRASHDEBUG(2) is set else update
vt->numnodes with 'n'
Please let me know of your opinion
Regards
Sharyathi Nagesh
Hi Sharyathi,
Thanks a lot for debugging this.
I prefer your idea (2) -- which if it works OK in your case -- will not break
any other currently-working incarnations.
Also, just to clarify, when you say "Raise the error...", node_table_init()
only makes an "error(NOTE, ...)" call, so you would simply get a "NOTE:
..."
message displayed if CRASHDEBUG(2), and the crash session would
still continue. That's also what we would want in this case, unlike the
"error(FATAL, ...)", session-ending, error that you're seeing now...
Thanks,
Dave