Worth, Kevin wrote:
Tried running crash on a running kernel... seems that 4.0-3.7
doesn't like my kernel. When I run crash 4.0-7.2 on a live system, it appears that it
has no problems with vmalloc'd module memory.
crash 4.0-3.7
...
GNU gdb 6.1
...
This GDB was configured as "i686-pc-linux-gnu"...
crash: /boot/System.map-2.6.20-17.39-custom2 and /dev/mem do not match!
Usage:
crash [-h [opt]][-v][-s][-i file][-d num] [-S] [mapfile] [namelist] [dumpfile]
Enter "crash -h" for details.
crash 4.0-7.2
...
GNU gdb 6.1
...
This GDB was configured as "i686-pc-linux-gnu"...
KERNEL: vmlinux-2.6.20-17.39-custom2
DUMPFILE: /dev/mem
CPUS: 2
DATE: Wed Oct 1 16:31:39 2008
UPTIME: 04:57:53
LOAD AVERAGE: 0.10, 0.09, 0.09
TASKS: 95
NODENAME: ProCurve-TMS-zl-Module
RELEASE: 2.6.20-17.39-custom2
VERSION: #3 SMP Wed Sep 24 10:11:03 PDT 2008
MACHINE: i686 (2200 Mhz)
MEMORY: 5 GB
PID: 15801
COMMAND: "crash"
TASK: 47bd6030 [THREAD_INFO: 4a8a8000]
CPU: 1
STATE: TASK_RUNNING (ACTIVE)
crash>
Since that seems ok (and I don't encounter the error) I'll run crash with -d7 on
the dump file to hopefully expose what is wrong with either the dump or with crash.
I've attached the output of crash with -d7... not sure how the mailing like handles
file attachments, but if needed I can paste the text. (or if there is something specific I
should look for let me know and I can paste just that section).
Yeah, crash 4.0-3.7 is 2 years old, which is pretty ancient.
Plus I'm only interested in helping out with the latest version.
But according to the above, 4.0-7.2 works OK on the live system?
You can do a "mod" command and it works OK?
Sometimes on larger-memory systems, running live using /dev/mem,
you might see the "WARNING: cannot access vmalloc'd module"
message because the physical memory that is backing the
vmalloc'd virtual address is in highmem, and cannot be
accessed by /dev/mem. In any case, it appears that the
module structures have all been read successfully on your
live system.
And that's kind of bothersome, because for all practical
purposes, the crash utility doesn't care where it's getting
the physical memory from (i.e., from /dev/mem or from the
dumpfile). And if it works on the live system, it should
work with the dumpfile.
Anyway, looking at the crash.log, here's what's happening:
Everything was running fine until the module initialization
step. The list of installed kernel modules is headed up
from the "modules" list_head symbol at 403c63a4, which
contains a pointer to the first module structure at
vmalloc address f9088280:
...
<readmem: 403c63a4, KVADDR, "modules", 4, (FOE), 83ff8cc>
please wait... (gathering module symbol data)
module: f9088280
The readmem() of that first module -- and the very first vmalloc
address -- at f9088280 required a page table translation:
<readmem: f9088280, KVADDR, "module struct", 1536, (ROE|Q), 842a5e0>
<readmem: 4044b000, KVADDR, "pgd page", 32, (FOE), 845a308>
<readmem: 6000, PHYSADDR, "pmd page", 4096, (FOE), 845b310>
<readmem: 1d515000, PHYSADDR, "page table", 4096, (FOE), 845c318>
That readmem() appears to have worked, because it thinks it
successfully read the module struct at that address. But when
it pulled out the address of the *next* module in the linked list,
it read this:
module: fffffffc
And when it tried to read that bogus address, it failed, and
led to the WARNING message:
<readmem: fffffffc, KVADDR, "module struct", 1536, (ROE|Q), 842a5e0>
<readmem: 7000, PHYSADDR, "page table", 4096, (FOE), 845c318>
crash: invalid kernel virtual address: fffffffc type: "module struct"
WARNING: cannot access vmalloc'd module memory
...
Although I cannot say for sure, I'm presuming that the initial
read of the module structure at f9088280 ended up reading from
the wrong location and therefore read garbage. You can verify
that by bringing the a dumpfile session, and doing this:
crash> module f9088280
It *should* display something that is recognizable as a module
structure. For example:
crash> mod | grep ext3
f8899080 ext3 123593 (not loaded) [CONFIG_KALLSYMS]
crash> module f8899080
struct module {
state = MODULE_STATE_LIVE,
list = {
next = 0xf8854a84,
prev = 0xf8876984
},
name = "ext3"
mkobj = {
kobj = {
k_name = 0xf88990cc "ext3",
name = "ext3",
kref = {
refcount = {
counter = 2
}
},
...
Your attempt will probably show the fffffffc in the list_head
just after the "state" field at the top, as well as a bunch
of other garbage.
And as I suggested in my first reply, can you also verify that
user virtual address translations also fail? I suggested pulling
a sample virtual address out of the current context's ("bash")
VM, but doing that may "look" like it's working, but it may
be doing it incorrectly. So you also need to verify the data
that it finds there. One way to do that is to read the beginning
of the /bin/bash text segment, and look for "ELF" string.
For example, here I'm in a "bash" context, similar to the
context that your dumpfile comes up in by default:
crash> set
PID: 19839
COMMAND: "bash"
TASK: f7b03000 [THREAD_INFO: def66000]
CPU: 1
STATE: TASK_INTERRUPTIBLE
crash>
Dump the virtual memory regions, and find the first VMA
that is backed by "/bin/bash":
crash> vm
PID: 19839 TASK: f7b03000 CPU: 1 COMMAND: "bash"
MM PGD RSS TOTAL_VM
f6dc5740 f745c9c0 1392k 4532k
VMA START END FLAGS FILE
f69019bc 6fa000 703000 75 /lib/libnss_files-2.5.so
f69013e4 703000 704000 100071 /lib/libnss_files-2.5.so
f6901d84 704000 705000 100073 /lib/libnss_files-2.5.so
f6901284 a7c000 a96000 875 /lib/ld-2.5.so
f6901b74 a96000 a97000 100871 /lib/ld-2.5.so
f6901b1c a97000 a98000 100873 /lib/ld-2.5.so
f69012dc a9a000 bd7000 75 /lib/libc-2.5.so
f690185c bd7000 bd9000 100071 /lib/libc-2.5.so
f6901ac4 bd9000 bda000 100073 /lib/libc-2.5.so
f69017ac bda000 bdd000 100073
f6901e8c bdf000 be1000 75 /lib/libdl-2.5.so
f6901a6c be1000 be2000 100071 /lib/libdl-2.5.so
f6901754 be2000 be3000 100073 /lib/libdl-2.5.so
f6901f94 c89000 c8c000 75 /lib/libtermcap.so.2.0.8
f69016fc c8c000 c8d000 100073 /lib/libtermcap.so.2.0.8
f6901d2c fd1000 fd2000 8000075
f6901124 8047000 80f5000 1875 /bin/bash
f69018b4 80f5000 80fa000 101873 /bin/bash
f6901964 80fa000 80ff000 100073
f690122c 9a75000 9a96000 100073
f680890c b7d7f000 b7f7f000 71 /usr/lib/locale/locale-archive
f6901f3c b7f7f000 b7f81000 100073
f68cfb74 b7f82000 b7f84000 100073
f6dd69bc b7f84000 b7f8b000 d1 /usr/lib/gconv/gconv-modules.cache
f69014ec bf86e000 bf884000 100173
crash>
You can see above, that in my case the text region starts at
user virtual address 8047000. That actually points to the
ELF header at the beginning of the "/bin/bash" file, which
starts with a 0x7f followed by the ascii "ELF" characters:
crash> rd 8047000
8047000: 464c457f .ELF
crash>
You might want to use "rd -u <address>" to ensure that
crash will presume that the address is a user address,
just in case that's an issue with your setup.
Anyway, try the above, and also dump out the and save
the output of these debug commands:
crash> help -m > help.k
crash> help -k > help.m
crash> help -v > help.v
But again, given that you seem to be saying that everything
works just fine on the live system, the debugging of this
issue will most likely end up requiring that you determine
where exactly things "go wrong" with the dumpfile in comparison
to the same things working correctly on the live system.
Thanks,
Dave