"Jansen, Frank" wrote:
> -----Original Message-----
> From: crash-utility-bounces@redhat.com
> [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
> Sent: Monday, May 14, 2007 12:22 PM
> To: Discussion list for crash utility usage, maintenance and
> development
> Subject: Re: [Crash-utility] Seek error type: "tss_struct ist
> array" problem on8-CPU AMD system
>
> "Jansen, Frank" wrote:
>
> > Looking through the changelog, I saw that the 'tss_struct ist array'
> > problem on 8-CPU systems had been addressed previously.
> However, I'm
> > running into this issue on an AMD server with crash 4.0-4.1
> and RHEL4
> > Update 5 (2.6.9-55.Elsmp).
> >
> > The output from the crash invocation is the following:
> > +++
> > [root@well-rhel4564-ps3 dump]# /fpj/crash System_map.2.6.9-55.ELsmp
> > vmlinux.debug.2.6.9-55.ELsmp ap3.1178895173.dmp
> >
> > crash 4.0-4.1
> > Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007  Red Hat, Inc.
> > Copyright (C) 2004, 2005, 2006  IBM Corporation Copyright (C)
> > 1999-2006  Hewlett-Packard Co Copyright (C) 2005, 2006  Fujitsu
> > Limited Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
> > Copyright (C) 2005  NEC Corporation
> > Copyright (C) 1999, 2002  Silicon Graphics, Inc.
> > Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
> > This program is free software, covered by the GNU General Public
> > License, and you are welcome to change it and/or distribute
> copies of
> > it under certain conditions.  Enter "help copying" to see the
> > conditions.
> > This program has absolutely no warranty.  Enter "help warranty" for
> > details.
> >
> > GNU gdb 6.1
> > Copyright 2004 Free Software Foundation, Inc.
> > GDB is free software, covered by the GNU General Public
> License, and
> > you are welcome to change it and/or distribute copies of it under
> > certain conditions.
> > Type "show copying" to see the conditions.
> > There is absolutely no warranty for GDB.  Type "show warranty" for
> > details.
> > This GDB was configured as "x86_64-unknown-linux-gnu"...
> >
> > crash: seek error: kernel virtual address: 10408119e84  type:
> > "tss_struct ist array"
> > ---
> >
> > The server is a 4 dual-core AMD (2.8GHz) with 64GB.
> >
> > Any insights into how best to troubleshoot this are much
> appreciated.
> >
> > Thanks,
> >
> > Frank Jansen
>
> I doubt this has anything to do with the 8-cpu issue.
>
I think that you are right, as the crash -d7 seems to indicate that the
dump may be incomplete(cf. attached crash -d7 output).

> A few questions:
>
> Is this an RHEL4 derivative kernel of some kind?  I ask
> because you're using a system.map file as an argument.
>
It's a standard kernel, to which we add a couple of our (Egenera)
drivers.  I can read the dump without the system map argument, but was
just going off the data provided to me by the person that ran into the
problem.

> Anyway, this dumpfile is Egenera's LKCD off-shoot, correct?
> Since you got an "lseek" error, the question is whether (1)
> the virtual address of 10408119e84 is legitimate, and (2)
> whether it is included in your dumpfile.

I think that the virtual address is legitimate, but that the dump is
incomplete at this point.

>
> What does "crash -d7 ..." show?

See attached output

>
> Does crash work on the live system?
Yes, it works

Right -- if it works on the live system, there's a good chance that
it's probably missing from the dumpfile.  The tss_struct for each
cpu is located in each cpu's per-cpu data area.  I have seen the
exact same problem with x86_64 netdump "vmcore-incomplete" dumpfiles,
where the per-cpu data areas, allocated with alloc_bootmem_node(),
would tend to be located in very high physical memory (beyond the
end of the vmcore-incomplete contents).

On a 64GB system,  the virtual address of 10408119e84 (~16GB physical)
would certainly not be out of the question.  And if it can be read
on the live machine (crash -d7 will show the same address access
sequence), then it's probably not included in the dumpfile for
whatever reason.

In fact, looking at the -d7 output, the level_pgt pagetable pointers
for each non-cpu0 cpu_pda get allocated with __get_free_pages() -- and
there's a couple from the 10408xxxxxx virtual memory location:

...
<readmem: ffffffff804ed700, KVADDR, "cpu_pda entry", 128, (FOE), 930580>
CPU0: level4_pgt: ffffffff80101000 data_offset: 10087adef60
<readmem: ffffffff804ed780, KVADDR, "cpu_pda entry", 128, (FOE), 930580>
CPU1: level4_pgt: 1040802c000 data_offset: 10487bf8d60
<readmem: ffffffff804ed800, KVADDR, "cpu_pda entry", 128, (FOE), 930580>
CPU2: level4_pgt: 10408008000 data_offset: 10887bf8d60
<readmem: ffffffff804ed880, KVADDR, "cpu_pda entry", 128, (FOE), 930580>
CPU3: level4_pgt: 10bf9ff2000 data_offset: 10c87bfbf60
<readmem: ffffffff804ed900, KVADDR, "cpu_pda entry", 128, (FOE), 930580>
CPU4: level4_pgt: 10008028000 data_offset: 10087ae6f60
<readmem: ffffffff804ed980, KVADDR, "cpu_pda entry", 128, (FOE), 930580>
CPU5: level4_pgt: 10bf9f8a000 data_offset: 10487c00d60
<readmem: ffffffff804eda00, KVADDR, "cpu_pda entry", 128, (FOE), 930580>
CPU6: level4_pgt: 100f7f08000 data_offset: 10887c00d60
<readmem: ffffffff804eda80, KVADDR, "cpu_pda entry", 128, (FOE), 930580>
CPU7: level4_pgt: 107f9f8e000 data_offset: 10c87c03f60
<readmem: 10008000084, KVADDR, "tss_struct ist array", 56, (FOE), 90c5b0>
<readmem: 10408119e84, KVADDR, "tss_struct ist array", 56, (FOE), 90c5e8>
crash: seek error: kernel virtual address: 10408119e84  type: "tss_struct ist array"

They weren't *read* from there at that point, but it shows that
there was memory in that neighborhood.  Anyway, the "seek error"
from LKCD means that the physical page couldn't be found in the
dumpfile by lkcd_lseek():

/*
 *  Read from an LKCD formatted dumpfile.
 */
int
read_lkcd_dumpfile(int fd, void *bufptr, int cnt, ulong addr, physaddr_t paddr)
{
        set_lkcd_fp(fp);

        if (!lkcd_lseek(paddr))
                return SEEK_ERROR;

        if (lkcd_read((void *)bufptr, cnt) != cnt)
                return READ_ERROR;

        return cnt;
}

I can't really help you from that point on, though...

Dave