CA SEOS module causes heartburn for crash(1)
by Kurtis D. Rader
I was asked to look at a s390 SUSE dump which had the Computer Associates
SEOS product modules loaded (seos and ksymadd). They exported symbols
with addresses well below the address space for modules. For example,
"dynamic_Seos_syscall_num" with an address of 0x4f90dc0 which is well
below the base of the first module at 0x7880d000. This results in crash
trying to mmap an anonymous 1.9 GB region. Which naturally fails on a
s390 system where the address space is only 2 GiB in size.
Anyone else run across this? Should crash be able to deal with this or
should I simply tell the customer to stop using CA's SEOS product if they
want us to look at their crash dumps?
--
Kurtis D. Rader, Level 3 Linux Support
ABC Service Center, Linux Change Team
T/L 775-3714, DID +1 503-578-3714
19 years
Re: [Crash-utility] crash 4.0-2.8 fails on 2.6.14-rc5 (EM64T)
by Dave Anderson
> There is no simple way to add #if KERNEL_VERSION > 2.6.10
> in the header file and leave the hardcoded values there ?
>
THIS_KERNEL_VERSION is based upon crash internal data variables in the
kernel_table data structure that get initialized in kernel_init(PRE_GDB)
based upon the contents of the kernel's "system_utsname" data structure
read from memory or the dumpfile.
I was mistaken in using the value of "_stext" as the qualifier, though,
since the __START_KERNEL_map value of 0xffffffff80000000 is still the same.
But there must be *some* difference in the symbol list that can be used
to determine which set of address values to use. It could even be just
the *existence* of some new kernel variable introduced as part of the
change to the new scheme. Doing an "nm -Bn" on the old and new
vmlinux files should yield something obvious.
> bt -t seems to better.
>
> crash> bt 3144
> PID: 3144 TASK: ffff81011dd1e100 CPU: 0 COMMAND: "mingetty"
> #0 [ffff81011d6b9c68] schedule at ffffffff803b12b3
> RIP: 000000377c7b85b2 RSP: 00007fffff87a110 RFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffffffff8010dc26 RCX: 00007fffff87a7b0
> RDX: 0000000000000001 RSI: 00007fffff87a8c7 RDI: 0000000000000000
> RBP: 00007fffff87aca0 R8: 00002aaaaaac9b00 R9: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000246 R12: 00007fffff87a900
> R13: 0000000000502d20 R14: 0000000000000000 R15: 000000007c92d8c0
> ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b
> crash> bt -t 3144
> PID: 3144 TASK: ffff81011dd1e100 CPU: 0 COMMAND: "mingetty"
> START: thread_return (schedule) at ffffffff803b12b3
> [ffff81011d6b9d10] do_con_write at ffffffff802689da
> [ffff81011d6b9d80] schedule_timeout at ffffffff803b1e4e
> [ffff81011d6b9db0] _spin_lock_irqsave at ffffffff803b28ce
> [ffff81011d6b9dc0] add_wait_queue at ffffffff8014cf5c
> [ffff81011d6b9de0] read_chan at ffffffff8025d1f7
> [ffff81011d6b9e48] default_wake_function at ffffffff80130c90
> [ffff81011d6b9e78] default_wake_function at ffffffff80130c90
> [ffff81011d6b9e90] tty_ldisc_deref at ffffffff802571c4
> [ffff81011d6b9ed0] tty_read at ffffffff802575ee
> [ffff81011d6b9f10] vfs_read at ffffffff80183a46
> [ffff81011d6b9f40] sys_read at ffffffff80183e03
> [ffff81011d6b9f80] system_call at ffffffff8010dc26
> RIP: 000000377c7b85b2 RSP: 00007fffff87a110 RFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffffffff8010dc26 RCX: 00007fffff87a7b0
> RDX: 0000000000000001 RSI: 00007fffff87a8c7 RDI: 0000000000000000
> RBP: 00007fffff87aca0 R8: 00002aaaaaac9b00 R9: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000246 R12: 00007fffff87a900
> R13: 0000000000502d20 R14: 0000000000000000 R15: 000000007c92d8c0
> ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b
> crash>
>
>
I still don't understand what happens in x86_64_low_budget_back_trace_cmd()
that causes the "bt" command to skip from the starting point in schedule()
to the end, where it dumps the user-mode entry exception frame, unless
the rsp has been bumped too high by the time it gets to this point:
/*
* Walk the process stack.
*/
for (i = (rsp - bt->stackbase)/sizeof(ulong);
!done && (rsp < bt->stacktop); i++, rsp += sizeof(ulong)) {
...and that conceivably may have something to do with the exception stack
problem. It's hard to say without being there...
Thanks,
Dave
19 years, 2 months
Re: Re: [Crash-utility] crash 4.0-2.8 fails on 2.6.14-rc5 (EM64T)
by Dave Anderson
This message was bounced due to its size of its attachment;
I've since bumped up the maximum allowable message size:
Re: [Crash-utility] crash 4.0-2.8 fails on 2.6.14-rc5 (EM64T)
Date: Wed, 26 Oct 2005 09:15:47 -0700
From: Badari Pulavarty <pbadari(a)us.ibm.com>
To: <crash-utility(a)redhat.com>
References: 1, 2, 3, 4
On Wed, 2005-10-26 at 11:51 -0400, Dave Anderson wrote:
> >
> > crash: read error: kernel virtual address: ffff8100050eb084 type:
> > "tss_struct ist array"
> >
>
> I see that the 2.6.13 kernel defines its init_tss
> array like so:
>
> DEFINE_PER_CPU(struct tss_struct, init_tss)
> ____cacheline_maxaligned_in_smp;
>
> whereas, the earlier 2.6 kernels do it like this:
>
> DECLARE_PER_CPU(struct tss_struct,init_tss);
>
> If this change modifies the way that per-cpu variable addresses
> are laid out, then I can't tell you what to do without significant
> further investigation. But until proven otherwise, let's presume
> that the calculations of the per-cpu data is done the same way.
>
> There are two places where that error message comes from, both
> in x86_64_ist_init(), but given that the above per-cpu declarations
> are functionally equivalent, there would be the following
> kernel symbol in your vmlinux, verifiable like so:
>
> $ nm -Bn vmlinux | grep per_cpu__init_tss
> ffffffff80502100 D per_cpu__init_tss
> $
>
> If it's not there, crash is hosed, then signficant work needs
> to be done to find it. But if the symbol is still intact in
> the 2.6.14 kernel, the failure should have come from an incorrect
> calculation of the vaddr of the init_tss below:
None of the above stuff changed, so we are fine.
> static void
> x86_64_ist_init(void)
> {
> ...
>
> } else if (symbol_exists("per_cpu__init_tss")) {
> for (c = 0; c < NR_CPUS; c++) {
> if ((kt->flags & SMP) && (kt->flags &
> PER_CPU_OFF)) {
> if (kt->__per_cpu_offset[c] == 0)
> break;
> vaddr = symbol_value
> ("per_cpu__init_tss") +
> kt->__per_cpu_offset[c];
> } else
> vaddr = symbol_value
> ("per_cpu__init_tss");
>
> vaddr += OFFSET(tss_struct_ist);
>
> readmem(vaddr, KVADDR, &ms->stkinfo.ebase
> [c][0],
> sizeof(ulong) * 7, "tss_struct ist
> array",
> FAULT_ON_ERROR);
>
Yes. I realized that the problem is due to messed up
kt->__per_cpu_offset[c] value. These should be offset into the array,
they should be small values. I see huge numbers.
per-cpu offset: 84afdf60
I also realized that this gets set at the lines I touched earlier :(
I can't seem to find out what I screwed up. We are just reading a value
from the kernel structure and setting it.
> if (ms->stkinfo.ebase[c][0] == 0)
> break;
> }
> }
>
> I'm also presuming your test kernel is SMP. But I'm wondering
> whether
> the SMP and PER_CPU_OFF flags are set?
Yes.
> The SMP flag should have been pre-set in kernel_init(), but the
> PER_CPU_OFF flag gets set in x86_64_cpu_pda_init(), which you
> have modified.
>
> You can display the kt->flags contents with a printk x86_64_ist_init
> ().
> If PER_CPU_OFF is not set, then that's probably the issue here.
>
> Can you show your new versions of x86_64_cpu_pda_init() and
> x86_64_get_smp_cpus()?
Here are new versions of x64-64 for your review.
Thanks,
Badari
19 years, 2 months
Cash white paper.
by Troy Heber
Hi Dave,
I would like to include the Crash white paper as part of the
documentation we ship to customers. However, there is no license
listed which equals no right to distribute. So I'm hoping for it to be
released GPL or, if that's not possible, explicit permission to
redistribute an unmodified copy of the white paper here:
http://people.redhat.com/anderson/.crash_whitepaper
Troy
19 years, 2 months
crash 4.0-2.8 fails on 2.6.14-rc5 (EM64T)
by Badari Pulavarty
Hi,
I am getting following failures from "crash" when tried
running on 2.6.14-rc5 on EM64T machine. Is this a known
problem ?
Thanks,
Badari
[root@localhost crash-4.0-2.8]# crash --readnow
crash 4.0-2.8
Copyright (C) 2002, 2003, 2004, 2005 Red Hat, Inc.
Copyright (C) 2004, 2005 IBM Corporation
Copyright (C) 1999-2005 Hewlett-Packard Co
Copyright (C) 1999, 2002 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public
License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for
details.
GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
crash: invalid structure member offset: x8664_pda_level4_pgt
FILE: x86_64.c LINE: 332 FUNCTION: x86_64_cpu_pda_init()
[/usr/bin/crash] error trace: 4456f3 => 4a6c8e => 4a7c0c => 4d0c1e
4d0c1e: OFFSET_verify+117
4a7c0c: x86_64_cpu_pda_init+771
4a6c8e: x86_64_init+1522
4456f3: main_loop+50
19 years, 2 months
crash version 4.0-2.8 is available
by Dave Anderson
Thanks to Jun'ichi Nomura of NEC for addressing an annoying issue
with the "mod" command when a registered kernel module name string
does not directly relate to its module object file. For example,
the "dm_mod" module comes from the "dm-mod.ko" object file, so the
"mod -S" or "mod -s dm_mod" commands would fail to load the debug
data for that module:
crash> mod -s dm_mod
mod: cannot find or load object file for dm_mod module
crash>
Since this inconsistency seems to be always due to there being
underscores in the kernel name string and dashes in the object
filename, Jun'ichi's patch retries all module debug data load
failures after replacing the underscores with dashes, and voila,
it finds the object file:
crash> mod -s dm_mod
MODULE NAME SIZE OBJECT FILE
ffffffffa01a5380 dm_mod 66433 lib/modules/2.6.9-22.2.ELsmp/kernel/drivers/md/dm-mod.ko
crash>
And, as expected, "mod -S" now finds and loads them all.
That is the only change in 4.0-2.8.
Thanks,
Dave
19 years, 2 months
crash and the "hugemem" kernel
by Kurtis D. Rader
Should crash be able to read a RHEL 3 hugemem dump? I've got a x86 14 GiB
netdump vmcore taken under controlled conditions (e.g., system was booted
and a crash dump manually invoked) that the crash(1) command doesn't like:
vmcore: initialization failed
netdump_data:
flags: 5 (NETDUMP_LOCAL|NETDUMP_ELF32)
ndfd: 3
ofp: 4b2b95e0
header_size: 4096
num_pt_load_segments: 1
pt_load_segment[0]:
file_offset: 1000
phys_start: 0
phys_end: 90000000
netdump_header: 838a700
elf32: 838a700
notes32: 838a734
load32: 838a754
elf64: 0
notes64: 0
load64: 0
nt_prstatus: 838a774
nt_prpsinfo: 838a814
nt_taskstruct: 838a8a0
task_struct: 0
switch_stack: 0
Elf32_Ehdr:
e_ident: \177ELF
e_ident[EI_CLASS]: 1 (ELFCLASS32)
e_ident[EI_DATA]: 1 (ELFDATA2LSB)
e_ident[EI_VERSION]: 1 (EV_CURRENT)
e_ident[EI_OSABI]: 0 (ELFOSABI_SYSV)
e_ident[EI_ABIVERSION]: 0
e_type: 4 (ET_CORE)
e_machine: 3 (EM_386)
e_version: 1 (EV_CURRENT)
e_entry: 0
e_phoff: 34
e_shoff: 0
e_flags: 0
e_ehsize: 34
e_phentsize: 20
e_phnum: 2
e_shentsize: 0
e_shnum: 0
e_shstrndx: 0
Elf32_Phdr:
p_type: 4 (PT_NOTE)
p_offset: 116 (74)
p_vaddr: 0
p_paddr: 0
p_filesz: 396 (18c)
p_memsz: 0 (0)
p_flags: 0 ()
p_align: 0
Elf32_Phdr:
p_type: 1 (PT_LOAD)
p_offset: 4096 (1000)
p_vaddr: c0000000
p_paddr: 0
p_filesz: 2415919104 (90000000)
p_memsz: 2415919104 (90000000)
p_flags: 7 (PF_X|PF_W|PF_R)
p_align: 4096
Elf32_Nhdr:
n_namesz: 4 ("CORE")
n_descsz: 144
n_type: 1 (NT_PRSTATUS)
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 023b0dc0 00000001
00000001 00000063 00000006 00000063
00000000 00000068 00000068 66d80000
73d80033 ffffffff 021d0950 00000060
00010286 96c15f3c 00000068 00000000
Elf32_Nhdr:
n_namesz: 4 ("CORE")
n_descsz: 124
n_type: 3 (NT_PRPSINFO)
00005200 00000000 00000000 00000000
00000000 00000000 00000000 696c6d76
0078756e 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000
Elf32_Nhdr:
n_namesz: 4 ("CORE")
n_descsz: 80
n_type: 4 (NT_TASKSTRUCT)
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
--
Kurtis D. Rader, Level 3 Linux Support
ABC Service Center, Linux Change Team
T/L 775-3714, DID +1 503-578-3714
19 years, 2 months
crash version 4.0-2.7 is available
by Dave Anderson
4.0-2.7 changelog entry:
Fixed x86_64 backtrace code to recognize 32-bit user code kernel
entry exception frames (code segment selectors of 0x23) without
issuing a "bt: WARNING: possibly bogus exception frame" message.
Fixed x86_64 backtrace code to recognize in-kernel exception
frames generated from module text in situations where the module
data was not included in the dumpfile, such as in a netdump which
resulted in a vmcore-incomplete file.
(10/19/05)
19 years, 2 months
diskdump/x86_64 crash compatibility.
by Neil Horman
Hey all-
If I have a RHEL4 crash dump taken from an AMD x86_64 machine, should I
be able to examine that vmcore using crash-4.0.1 on a RHEL4 box running on EM64T
hardware? I would have assumed that I would be able to, but I keep getting file
format errors when trying to read the core.
Neil
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*nhorman(a)redhat.com
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/
19 years, 2 months
Michael Holzheu/Germany/IBM is out of the office.
by Michael Holzheu
I will be out of the office starting 08/17/2005 and will not return until
03/01/2006.
I am away for six months and will not read my mails in this time. For
technical questions please contact Raimund Schroeder
(Raimund.Schroeder(a)de.ibm.com).
19 years, 2 months