Dave,
I agree that we own it from this side to figure out where the rest of
the dump went.
Thank you again for your help,
Frank
________________________________
From: crash-utility-bounces(a)redhat.com
[mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
Sent: Monday, May 14, 2007 3:16 PM
To: Discussion list for crash utility usage, maintenance and
development
Subject: Re: [Crash-utility] Seek error type: "tss_struct ist
array" problemon8-CPU AMD system
"Jansen, Frank" wrote:
-----Original Message-----
From: crash-utility-bounces(a)redhat.com
[mailto:crash-utility-bounces@redhat.com] On Behalf Of
Dave Anderson
Sent: Monday, May 14, 2007 12:22 PM
To: Discussion list for crash utility usage,
maintenance and
development
Subject: Re: [Crash-utility] Seek error type:
"tss_struct ist
array" problem on8-CPU AMD system
"Jansen, Frank" wrote:
> Looking through the changelog, I saw that the
'tss_struct ist array'
> problem on 8-CPU systems had been addressed
previously.
However, I'm
> running into this issue on an AMD server with crash
4.0-4.1
and RHEL4
> Update 5 (2.6.9-55.Elsmp).
>
> The output from the crash invocation is the
following:
> +++
> [root@well-rhel4564-ps3 dump]# /fpj/crash
System_map.2.6.9-55.ELsmp
> vmlinux.debug.2.6.9-55.ELsmp ap3.1178895173.dmp
>
> crash 4.0-4.1
> Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007
Red Hat, Inc.
> Copyright (C) 2004, 2005, 2006 IBM Corporation
Copyright
(C)
> 1999-2006 Hewlett-Packard Co Copyright (C) 2005,
2006
Fujitsu
> Limited Copyright (C) 2006, 2007 VA Linux Systems
Japan
K.K.
> Copyright (C) 2005 NEC Corporation
> Copyright (C) 1999, 2002 Silicon Graphics, Inc.
> Copyright (C) 1999, 2000, 2001, 2002 Mission
Critical Linux, Inc.
> This program is free software, covered by the GNU
General
Public
> License, and you are welcome to change it and/or
distribute
copies of
> it under certain conditions. Enter "help copying"
to see the
> conditions.
> This program has absolutely no warranty. Enter
"help warranty" for
> details.
>
> GNU gdb 6.1
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General
Public
License, and
> you are welcome to change it and/or distribute
copies of it under
> certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show
warranty" for
> details.
> This GDB was configured as
"x86_64-unknown-linux-gnu"...
>
> crash: seek error: kernel virtual address:
10408119e84 type:
> "tss_struct ist array"
> ---
>
> The server is a 4 dual-core AMD (2.8GHz) with 64GB.
>
> Any insights into how best to troubleshoot this are
much
appreciated.
>
> Thanks,
>
> Frank Jansen
I doubt this has anything to do with the 8-cpu issue.
I think that you are right, as the crash -d7 seems to
indicate that the
dump may be incomplete(cf. attached crash -d7 output).
A few questions:
Is this an RHEL4 derivative kernel of some kind? I
ask
because you're using a system.map file as an argument.
It's a standard kernel, to which we add a couple of our
(Egenera)
drivers. I can read the dump without the system map
argument, but was
just going off the data provided to me by the person
that ran into the
problem.
Anyway, this dumpfile is Egenera's LKCD off-shoot,
correct?
Since you got an "lseek" error, the question is
whether (1)
the virtual address of 10408119e84 is legitimate, and
(2)
whether it is included in your dumpfile.
I think that the virtual address is legitimate, but that
the dump is
incomplete at this point.
> What does "crash -d7 ..." show?
See attached output
> Does crash work on the live system?
Yes, it works
Right -- if it works on the live system, there's a good chance
that
it's probably missing from the dumpfile. The tss_struct for
each
cpu is located in each cpu's per-cpu data area. I have seen the
exact same problem with x86_64 netdump "vmcore-incomplete"
dumpfiles,
where the per-cpu data areas, allocated with
alloc_bootmem_node(),
would tend to be located in very high physical memory (beyond
the
end of the vmcore-incomplete contents).
On a 64GB system, the virtual address of 10408119e84 (~16GB
physical)
would certainly not be out of the question. And if it can be
read
on the live machine (crash -d7 will show the same address access
sequence), then it's probably not included in the dumpfile for
whatever reason.
In fact, looking at the -d7 output, the level_pgt pagetable
pointers
for each non-cpu0 cpu_pda get allocated with __get_free_pages()
-- and
there's a couple from the 10408xxxxxx virtual memory location:
...
<readmem: ffffffff804ed700, KVADDR, "cpu_pda entry", 128, (FOE),
930580>
CPU0: level4_pgt: ffffffff80101000 data_offset: 10087adef60
<readmem: ffffffff804ed780, KVADDR, "cpu_pda entry", 128, (FOE),
930580>
CPU1: level4_pgt: 1040802c000 data_offset: 10487bf8d60
<readmem: ffffffff804ed800, KVADDR, "cpu_pda entry", 128, (FOE),
930580>
CPU2: level4_pgt: 10408008000 data_offset: 10887bf8d60
<readmem: ffffffff804ed880, KVADDR, "cpu_pda entry", 128, (FOE),
930580>
CPU3: level4_pgt: 10bf9ff2000 data_offset: 10c87bfbf60
<readmem: ffffffff804ed900, KVADDR, "cpu_pda entry", 128, (FOE),
930580>
CPU4: level4_pgt: 10008028000 data_offset: 10087ae6f60
<readmem: ffffffff804ed980, KVADDR, "cpu_pda entry", 128, (FOE),
930580>
CPU5: level4_pgt: 10bf9f8a000 data_offset: 10487c00d60
<readmem: ffffffff804eda00, KVADDR, "cpu_pda entry", 128, (FOE),
930580>
CPU6: level4_pgt: 100f7f08000 data_offset: 10887c00d60
<readmem: ffffffff804eda80, KVADDR, "cpu_pda entry", 128, (FOE),
930580>
CPU7: level4_pgt: 107f9f8e000 data_offset: 10c87c03f60
<readmem: 10008000084, KVADDR, "tss_struct ist array", 56,
(FOE), 90c5b0>
<readmem: 10408119e84, KVADDR, "tss_struct ist array", 56,
(FOE), 90c5e8>
crash: seek error: kernel virtual address: 10408119e84 type:
"tss_struct ist array"
They weren't *read* from there at that point, but it shows that
there was memory in that neighborhood. Anyway, the "seek error"
from LKCD means that the physical page couldn't be found in the
dumpfile by lkcd_lseek():
/*
* Read from an LKCD formatted dumpfile.
*/
int
read_lkcd_dumpfile(int fd, void *bufptr, int cnt, ulong addr,
physaddr_t paddr)
{
set_lkcd_fp(fp);
if (!lkcd_lseek(paddr))
return SEEK_ERROR;
if (lkcd_read((void *)bufptr, cnt) != cnt)
return READ_ERROR;
return cnt;
}
I can't really help you from that point on, though...
Dave