Re: [Crash-utility] [PATCH] SIAL extension: bitfield handling fixup
by Dave Anderson
Thanks Hedi,
Forwarding to Luc Chouinard for his ACK.
Dave
----- "Hedi Berriche" <hedi(a)sgi.com> wrote:
> This is a patch from Olaf Weber that fixes a bug in the SIAL extension
> when
> dealing with bitfields.
>
> Cheers,
> Hedi.
>
> Signed-off-by: Olaf Weber <olaf(a)sgi.com>
> Tested-by: Hedi Berriche <hedi(a)sgi.com>
>
> --- a/extensions/sial.c 2008-10-14 14:35:40.000000000 +0100
> +++ b/extensions/sial.c 2008-10-21 13:16:39.024057612 +0100
> @@ -363,7 +363,7 @@ int midx;
> sial_member_soffset(m, TYPE_FIELD_BITPOS(type, midx)/8);
> sial_member_ssize(m, TYPE_FIELD_TYPE(type, midx)->length);
> sial_member_snbits(m, TYPE_FIELD_BITSIZE(type, midx));
> - sial_member_sfbit(m, TYPE_FIELD_BITSIZE(type, midx));
> + sial_member_sfbit(m, TYPE_FIELD_BITPOS(type, midx)%8);
> sial_member_sname(m, TYPE_FIELD_NAME(type, midx));
> LASTNUM=midx+1;
> return drilldowntype(TYPE_FIELD_TYPE(type, midx), tm);
> --- a/extensions/libsial/sial_type.c 2008-10-14 14:35:40.000000000
> +0100
> +++ b/extensions/libsial/sial_type.c 2008-10-21 13:16:48.823058008
> +0100
> @@ -278,9 +278,6 @@ get_bit_value(ull val, int nbits, int bo
> int dosign=0;
> int vnbits=size*8;
>
> -
> - val = API_GET_UINT64(&val);
> -
> /* first get the value_t */
> if (nbits >= 32) {
> int upper_bits = nbits - 32;
>
> --
> Hedi Berriche
> Global Product Support
>
16 years, 2 months
[PATCH] Fix for "files" command on 2.6.25 or later kernels
by Dave Anderson
Fix for the "files" command when run on 2.6.25 and later kernels,
which either fails with an "invalid kernel virtual address" error
of type "fill_dentry_cache", or shows nonsensical/garbage "ROOT"
and "CWD" directory pathnames. This was due to the change
in format of the kernel's fs_struct.
Queued for the next release.
16 years, 2 months
[PATCH] SIAL extension: bitfield handling fixup
by Hedi Berriche
This is a patch from Olaf Weber that fixes a bug in the SIAL extension when
dealing with bitfields.
Cheers,
Hedi.
Signed-off-by: Olaf Weber <olaf(a)sgi.com>
Tested-by: Hedi Berriche <hedi(a)sgi.com>
--- a/extensions/sial.c 2008-10-14 14:35:40.000000000 +0100
+++ b/extensions/sial.c 2008-10-21 13:16:39.024057612 +0100
@@ -363,7 +363,7 @@ int midx;
sial_member_soffset(m, TYPE_FIELD_BITPOS(type, midx)/8);
sial_member_ssize(m, TYPE_FIELD_TYPE(type, midx)->length);
sial_member_snbits(m, TYPE_FIELD_BITSIZE(type, midx));
- sial_member_sfbit(m, TYPE_FIELD_BITSIZE(type, midx));
+ sial_member_sfbit(m, TYPE_FIELD_BITPOS(type, midx)%8);
sial_member_sname(m, TYPE_FIELD_NAME(type, midx));
LASTNUM=midx+1;
return drilldowntype(TYPE_FIELD_TYPE(type, midx), tm);
--- a/extensions/libsial/sial_type.c 2008-10-14 14:35:40.000000000 +0100
+++ b/extensions/libsial/sial_type.c 2008-10-21 13:16:48.823058008 +0100
@@ -278,9 +278,6 @@ get_bit_value(ull val, int nbits, int bo
int dosign=0;
int vnbits=size*8;
-
- val = API_GET_UINT64(&val);
-
/* first get the value_t */
if (nbits >= 32) {
int upper_bits = nbits - 32;
--
Hedi Berriche
Global Product Support
16 years, 2 months
Re: [Crash-utility] kmem -[sS] segfault on 2.6.25.17
by Dave Anderson
----- "Mike Snitzer" <snitzer(a)gmail.com> wrote:
>
> Anyway, in the end when I use your patched crash: Things look very
> nice!
>
OK great, thanks. I'm running some tests on a set of saved dumpfiles
and if I don't see any problems, I'll queue it for the next release.
Dave
16 years, 2 months
Re: [Crash-utility] kmem -[sS] segfault on 2.6.25.17
by Dave Anderson
Mike,
Apply this patch to 4.0-7.4 and run it on your live system.
It works with the sample vmcore you sent me, and it should also alleviate
many of the errors that I suggested might be due to the underlying
shifting sands of your live system.
I don't know why this is the first report of this, nor why I've never
seen this before? The problem has to do with using invalid entries
in the kmem_cache.nodelists[MAXNUMNODES] array of kmem_list3 data
structures. To date, the unused entries have typically been NULL
for non-existent memory nodes, but not so with your kernel.
Let me know how it works for you.
Thanks,
Dave
16 years, 2 months
Re: [Crash-utility] kmem -[sS] segfault on 2.6.25.17
by Dave Anderson
> ----- "Mike Snitzer" <snitzer(a)gmail.com> wrote:
> > BTW, if need be, would you be able to make the vmlinux/vmcore pair
> > available for download somewhere? (You can contact me off-list
> with the particulars...)
>
> I can work to make that happen if needed...
OK, I've got your vmlinux/vmcore pair, and right away I can see
what the problem is. The BZERO/memset runs into the weeds here:
BZERO(si->cpudata[i], sizeof(ulong) * vt->kmem_max_limit);
Running "help -v" dumps the crash-internal VM-related table, and the
initialization-time determination of vt->kmem_max_limit is absurd
(-32512, or ffffffffffff8100):
crash> help -v
flags: 5c52
(NODES_ONLINE|ZONES|PERCPU_KMALLOC_V2|KMEM_CACHE_INIT|SPARSEMEM|SPARSEMEM_EX|PERCPU_KMALLOC_V2_NODES)
kernel_pgd[NR_CPUS]: ffffffff80201000 ...
high_memory: ffff81007fb50000
vmalloc_start: ffffc20000000000
mem_map: 0
total_pages: 523088
max_mapnr: 0
totalram_pages: 479819
totalhigh_pages: 0
num_physpages: 523088
page_hash_table: 0
page_hash_table_len: 0
kmem_max_c_num: 92
kmem_max_limit: -32512
kmem_max_cpus: 4
kmem_cache_count: 184
...
When you run this same kernel on the live system, what do you see
when you enter "help -v"?
Dave
16 years, 2 months
Re: [Crash-utility] "cannot access vmalloc'd module memory" when loading kdump'ed vmcore in crash
by Dave Anderson
----- "Kevin Worth" <kevin.worth(a)hp.com> wrote:
> Hi Dave,
>
> Thanks for all your help with the crash/kexec troubleshooting, I
> really appreciate it. At this point I’m going toward the
> kernel/kexec/kdump side of things and am trying to do whatever
> research I can before sending another mail to kexec-ml. I tried to see
> if the issue still occurs with the Ubuntu “generic” kernel, which has
> HIGHMEM4G instead of HIGHMEM64G, and the standard VMSPLIT/PAGE_OFFSET
> setting. I also noted that enabling HIGHMEM64G also implicitly seems
> to select CONFIG_RESOURCES_64BIT=y which was marked Experimental in
> 2.6.20 (hmmm).
>
> Interestingly enough, when I use the Ubuntu "generic" kernel
> (HIGHMEM4G + normal VMSPLIT), the crash dump works just fine and I can
> view the modules, etc. This leads me to believe that it's either the
> HIGHMEM64G or the modified VMSPLIT that is causing the problem (or
> both in combination). I'm going to try compiling kernels with only one
> setting changed in each to try to isolate the issue.
>
> Warning: probably a stupid question-- My machine has 4GB of memory. If
> I boot the "generic" kernel with HIGHMEM4G, only 3GB is recognized
> (according to my understanding of free and /proc/meminfo results).
> However, when I create a dump and load it up in crash, I see " MEMORY:
> 4 GB". Am I just misunderstanding the output of free/meminfo or is
> this just crash calculating something differently?
It's the same as my answer about how the "MEMORY: 5GB" was reported
on your earlier dumpfile. If you run with the "-d 1" command line
option, check out the crash-internal "node_table[0]" dump, and multiply
the "present" value times 4K (PAGE_SIZE).
When you run on the live system with "-d 1", the "node_table[0]"
output will be the same as what you see with the dumpfile.
BTW, that's presuming that, with your new kernel configuration,
that your kernel still creates just one node. If by chance, there
are more than one node, you'll see an additional node_table[1] dump,
etc., and the MEMORY: display is the sum total of all the "present"
pages.
Dave
16 years, 2 months
Re: [Crash-utility] "cannot access vmalloc'd module memory" when loading kdump'ed vmcore in crash
by Dave Anderson
----- "Kevin Worth" <kevin.worth(a)hp.com> wrote:
> Hi Dave,
>
> Before you responded I noticed that a simple "make modules" didn't
> work because my kernel wasn't exporting the symbol. Rather than do
> anything risky/complex which might risk mucking up the troubleshooting
> process, I just rebuilt the kernel. It built just fine and now I can
> load crash and I see "DUMPFILE: /dev/crash" when I load up crash. Let
> me try walking through the steps that you had me do previously, this
> time using /dev/crash instead of /dev/mem and /dev/kmem
You made one small error (but not totally fatal) in the suggested steps.
See my comments below...
>
> >From my limited understanding of what's going on here, it would
> appear that the dump file is missing some data, or else crash is
> looking in the wrong place for it.
The crash utility is a slave to what is indicated in the PT_LOAD
segments of the ELF header of the kdump vmcore. In the case of
the physical memory chunk that starts at 4GB physical on your machine,
this is what's in the ELF header (from your original "crash.log" file):
Elf64_Phdr:
p_type: 1 (PT_LOAD)
p_offset: 3144876760 (bb7302d8)
p_vaddr: ffffffffffffffff
p_paddr: 100000000
p_filesz: 1073741824 (40000000)
p_memsz: 1073741824 (40000000)
p_flags: 7 (PF_X|PF_W|PF_R)
p_align: 0
What that says is: for the range of physical memory starting
at 0x100000000 (p_paddr), the vmcore contains a block of
memory starting at file offset (p_offset) 3144876760/0xbb7302d8
that is 1073741824/0x40000000 (p_filesz) bytes long.
More simply put, the 1GB of physical memory from 4GB to 5GB
can be found in the vmcore file starting at file offset 3144876760.
So if a request for physical memory page 0x100000000 comes
in, the crash utility reads from vmcore file offset 3144876760.
If the next physical page were requested, i.e., at 0x100001000,
it would read from vmcore file offset 3144876760+4096. It's
as simple as that -- so when you suggest that "crash is looking
in the wrong place for it", well, there's nothing that the
crash utility can do differently.
Now, back to the test sequence:
> ---Live system---
>
> KERNEL: vmlinux-devcrash
> DUMPFILE: /dev/crash
> CPUS: 2
> DATE: Tue Oct 14 16:08:28 2008
> UPTIME: 00:02:07
> LOAD AVERAGE: 0.17, 0.08, 0.03
> TASKS: 97
> NODENAME: test-machine
> RELEASE: 2.6.20-17.39-custom2
> VERSION: #1 SMP Tue Oct 14 13:45:17 PDT 2008
> MACHINE: i686 (2200 Mhz)
> MEMORY: 5 GB
> PID: 5628
> COMMAND: "crash"
> TASK: 5d4c2560 [THREAD_INFO: f3de6000]
> CPU: 1
> STATE: TASK_RUNNING (ACTIVE)
>
> crash> p modules
> modules = $2 = {
> next = 0xf8a3ea04,
> prev = 0xf8842104
> }
>
> crash> module 0xf8a3ea00
> struct module {
> state = MODULE_STATE_LIVE,
> list = {
> next = 0xf8d10484,
> prev = 0x403c63a4
> },
> name =
> "crash\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\
> 000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\
> 000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
> mkobj = {
> kobj = {
> k_name = 0xf8a3ea4c "crash",
> name =
> "crash\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
> kref = {
> refcount = {
> counter = 3
> }
> },
> entry = {
> next = 0x403c6068,
> prev = 0xf8d104e4
> },
> parent = 0x403c6074
> ...
>
> crash> vtop 0xf8a3ea00
> VIRTUAL PHYSICAL
> f8a3ea00 116017a00
OK -- so the physical memory location of the module data structure
is at physical address 116017a00, but...
>
> PAGE DIRECTORY: 4044b000
> PGD: 4044b018 => 6001
> PMD: 6e28 => 1d51a067
> PTE: 1d51a1f0 => 116017163
> PAGE: 116017000
>
> PTE PHYSICAL FLAGS
> 116017163 116017000 (PRESENT|RW|ACCESSED|DIRTY|GLOBAL)
>
> PAGE PHYSICAL MAPPING INDEX CNT FLAGS
> 472c02e0 116017000 0 229173 1 80000000
>
You're reading from the beginning of the page, i.e., 116017000
instead of where the module structure is at 116017a00:
> crash> rd -p 116017000 30
> 116017000: 53e58955 d089c389 4d8bca89 74c98508 U..S.......M...t
> 116017010: 01e9831f b85b0d74 ffffffea ffffba5d ....t.[.....]...
> 116017020: 03c3ffff 53132043 26b48d24 00000000 ....C .S$..&....
> 116017030: 89204389 5d5b2453 26b48dc3 00000000 .C .S$[]...&....
> 116017040: 83e58955 55892cec 08558be4 89f45d89 U....,.U..U..]..
> 116017050: 7d89f875 ffeabffc 4d89ffff 8b028be0 u..}.......M....
> 116017060: c3890452 ac0fd689 45890cf3 0ceec1ec R..........E....
> 116017070: 5589c889 89d231f0 ...U.1..
> crash>
>
So therefore you're not seeing the "crash" strings embedded in
the raw physical data. Now, although it would have been "nice"
if you could have shown the contents of the module structure via
the physical address, the fact remains that since you used the
/dev/crash driver, the "module 0xf8a3ea00" command required that
the crash utility first translate the vmalloc address into its
physical equivalent, and then read from there.
In any case, you do have a dump of physical memory from 116017000
which at least is in the same 4k page as the module data structure,
so it should not change when read from the dumpfile.
> ---Using dump file---
>
>
> please wait... (gathering module symbol data)
> WARNING: cannot access vmalloc'd module memory
>
> KERNEL: vmlinux-devcrash
> DUMPFILE: /var/crash/vmcore
> CPUS: 2
> DATE: Tue Oct 14 16:09:32 2008
> UPTIME: 00:03:12
> LOAD AVERAGE: 0.09, 0.08, 0.02
> TASKS: 97
> NODENAME: test-machine
> RELEASE: 2.6.20-17.39-custom2
> VERSION: #1 SMP Tue Oct 14 13:45:17 PDT 2008
> MACHINE: i686 (2200 Mhz)
> MEMORY: 5 GB
> PANIC: "[ 192.148000] SysRq : Trigger a crashdump"
> PID: 0
> COMMAND: "swapper"
> TASK: 403c0440 (1 of 2) [THREAD_INFO: 403f2000]
> CPU: 0
> STATE: TASK_RUNNING (SYSRQ)
>
> crash> p modules
> modules = $2 = {
> next = 0xf8a3ea04,
> prev = 0xf8842104
> }
>
> crash> module 0xf8a3ea00
> struct module {
> state = MODULE_STATE_LIVE,
> list = {
> next = 0x0,
> prev = 0x0
> },
> name =
> "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\0
> 00\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\0
> 00\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\0
> 00\000",
> mkobj = {
> kobj = {
> k_name = 0x0,
> name =
> "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\0
> 00\000\000",
> kref = {
> refcount = {
> counter = 0
> }
> },
> entry = {
> next = 0x0,
> prev = 0x0
> ...
>
> crash> vtop 0xf8a3ea00
> VIRTUAL PHYSICAL
> f8a3ea00 116017a00
>
> PAGE DIRECTORY: 4044b000
> PGD: 4044b018 => 6001
> PMD: 6e28 => 1d51a067
> PTE: 1d51a1f0 => 116017163
> PAGE: 116017000
>
> PTE PHYSICAL FLAGS
> 116017163 116017000 (PRESENT|RW|ACCESSED|DIRTY|GLOBAL)
>
> PAGE PHYSICAL MAPPING INDEX CNT FLAGS
> 472c02e0 116017000 0 229173 1 80000000
>
> crash> rd -p 116017000 30
> 116017000: 00000000 00000000 00000000 00000000 ................
> 116017010: 00000000 00000000 00000000 00000000 ................
> 116017020: 00000000 00000000 00000000 00000000 ................
> 116017030: 00000000 00000000 00000000 00000000 ................
> 116017040: 00000000 00000000 00000000 00000000 ................
> 116017050: 00000000 00000000 00000000 00000000 ................
> 116017060: 00000000 00000000 00000000 00000000 ................
> 116017070: 00000000 00000000 ........
> crash>
Now we're reading the same physical address as you did on
the dumpfile, and it's returning all zeroes. And the
"module 0xf8a3ea00" above shows all zeroes from a higher
location in the page because the same vmalloc translation is
done to turn it into a physical address before reading it
from the vmcore file. But instead of using the /dev/crash driver
to access the translated physical memory, the crash utility
uses the information from the ELF header's PT_LOAD segments
to find out where to find the page data in the vmcore file.
So, anyway, the "rd -p 116017000 30" command that you did
on both the live system and the dumpfile should yield the same
data.
It seems like in all examples to date, the file data read
at the greater-than-4GB PT_LOAD segment returns zeroes.
You can verify this from the crash utility's viewpoint by
doing a "help -n" during runtime when running with the dumpfile,
which will show you both the actual contents of the ELF header,
as well as the manner in which the PT_LOAD data is stored for
its use. (It's also shown with the "crash -d7 ..." output).
So again, from your original "crash.log" file, here is what the
ELF header's PT_LOAD segment contains:
Elf64_Phdr:
p_type: 1 (PT_LOAD)
p_offset: 3144876760 (bb7302d8)
p_vaddr: ffffffffffffffff
p_paddr: 100000000
p_filesz: 1073741824 (40000000)
p_memsz: 1073741824 (40000000)
p_flags: 7 (PF_X|PF_W|PF_R)
p_align: 0
And this is what the crash utility stored in its internal
data structure for that particular segment:
pt_load_segment[4]:
file_offset: bb7302d8
phys_start: 100000000
phys_end: 140000000
zero_fill: 0
And when the physical memory read request comes in, it filters
to this part of the crash utility's read_netdump() function in
netdump.c:
for (i = offset = 0; i < nd->num_pt_load_segments; i++) {
pls = &nd->pt_load_segments[i];
if ((paddr >= pls->phys_start) &&
(paddr < pls->phys_end)) {
offset = (off_t)(paddr - pls->phys_start) +
pls->file_offset;
break;
}
if (pls->zero_fill && (paddr >= pls->phys_end) &&
(paddr < pls->zero_fill)) {
memset(bufptr, 0, cnt);
return cnt;
}
}
So for any physical address request between 100000000 to 140000000,
(4GB to 5GB) it will calculate the offset to seek to by subtracting
100000000 from the incoming physical address, and adding the difference
to the starting file offset of the whole segment.
So if you wanted to, you could put debug code just prior to the "break" above
that shows the pls->file_offset for a given incoming physical address.
But this code has been in place forever, so it's hard to conceive that
somehow it's not working in the case of this dumpfile. But presuming that
it *does* go to the correct file offset location in the vmcore, and it's
getting bogus data from there, then there's nothing that the crash
utility can do about it.
Dave
16 years, 2 months
Re: [Crash-utility] kmem -[sS] segfault on 2.6.25.17
by Dave Anderson
----- "Mike Snitzer" <snitzer(a)gmail.com> wrote:
> Frame 0 of crash's core shows:
> (gdb) bt
> #0 0x0000003b708773e0 in memset () from /lib64/libc.so.6
>
> I'm not sure how to get the faulting address though? Is it just
> 0x0000003b708773e0?
No, that's the text address in memset(). If you "disass memset",
I believe that you'll see that the address above is dereferencing
the rcx register/pointer. So then, if you enter "info registers",
you'll get a register dump, and rcx would be the failing address.
(To reproduce this, I inserted a "0xdeadbeef" into si->cpuinfo[0]
and saw the 0xdeadbeef in rcx with "info registers")
Or you can always just put an "fprintf(fp, "...")" debug statement in
that function to display the address its BZERO'ing. Could get a
little verbose...
>
> > And for sanity's sake, what is the crash utility's
> vm_table.kmem_max_limit
> > equal to, and what architecture are you running on?
>
> Architecture is x86_64.
>
> kmem_max_limit=128, sizeof(ulong)=8; so the memset() should in fact
> be
> zero'ing all 1024 (0x400) bytes that were allocated.
OK, so that all looks normal...
> So the thing is; now when I run live crash on the 2.6.25.17 devel
> kernel I no longer git a segfault!? It still isn't happy but its at
> least not segfaulting.. very odd.
Not necessarily...
>
> I've not rebooted the system at all either... now when I run 'kmem
> -s'
> in live crash I see:
>
> CACHE NAME OBJSIZE ALLOCATED TOTAL
> SLABS SSIZE
> ...
> kmem: nfs_direct_cache: full list: slab: ffff810073503000 bad inuse
> counter: 5
> kmem: nfs_direct_cache: full list: slab: ffff810073503000 bad inuse
> counter: 5
> kmem: nfs_direct_cache: partial list: bad slab pointer: 88
> kmem: nfs_direct_cache: full list: bad slab pointer: 98
> kmem: nfs_direct_cache: free list: bad slab pointer: a8
> kmem: nfs_direct_cache: partial list: bad slab pointer:
> 9f911029d74e35b
> kmem: nfs_direct_cache: full list: bad slab pointer: 6b6b6b6b6b6b6b6b
> kmem: nfs_direct_cache: free list: bad slab pointer: 6b6b6b6b6b6b6b6b
> kmem: nfs_direct_cache: partial list: bad slab pointer: 100000001
> kmem: nfs_direct_cache: full list: bad slab pointer: 100000011
> kmem: nfs_direct_cache: free list: bad slab pointer: 100000021
> ffff810073501600 nfs_direct_cache 192 2 40
> 2 4k
> ...
> kmem: nfs_write_data: partial list: bad slab pointer: 65676e61725f32
> kmem: nfs_write_data: full list: bad slab pointer: 65676e61725f42
> kmem: nfs_write_data: free list: bad slab pointer: 65676e61725f52
> kmem: nfs_write_data: partial list: bad slab pointer:
> 74736f705f73666e
> kmem: nfs_write_data: full list: bad slab pointer: 74736f705f73667e
> kmem: nfs_write_data: free list: bad slab pointer: 74736f705f73668e
> ffff81007350a5c0 nfs_write_data 760 36 40
> 8 4k
> ...
> etc.
Are those warnings happening on *every* slab type? When you run on a
live system, the "shifting sands" of the kernel underneath the crash
utility can cause errors like the above. But at least some/most of
the other slabs' infrastructure should remain stable while the command
runs.
>
> But if I run crash against the vmcore I do get the segfault...
>
When you run it on the vmcore, do you get the segfault immediately?
Or do some slabs display their stats OK, but then when it deals with
one particular slab it generates the segfault?
I mean that it's possible that the target slab was in transition
at the time of the crash, in which case you might see some error
messages like you see on the live system. But it is difficult to
explain why it's dying specifically where it is, even if the slab
was in transition.
That all being said, even if the slab was in transition, obviously
the crash utility should be able to handle it more gracefully...
> > BTW, if need be, would you be able to make the vmlinux/vmcore pair
> > available for download somewhere? (You can contact me off-list
> with
> > the particulars...)
>
> I can work to make that happen if needed...
FYI, I did try our RHEL5 "debug" kernel (2.6.18 + hellofalotofpatches),
which has both CONFIG_DEBUG_SLAB and CONFIG_DEBUG_SLAB_LEAK turned on,
but I don't see the problem. So unless something obvious can be
determined, that may be the only way I can help.
Dave
16 years, 2 months
Re: [Crash-utility] kmem -[sS] segfault on 2.6.25.17
by Dave Anderson
----- "Mike Snitzer" <snitzer(a)gmail.com> wrote:
> On Thu, Oct 16, 2008 at 1:16 PM, Dave Anderson <anderson(a)redhat.com>
> wrote:
> >
> > ----- "Dave Anderson" <anderson(a)redhat.com> wrote:
> >> Ok, then I can't see off-hand why it would segfault. Prior to
> this
> >> routine running, si->cpudata[0...i] all get allocated buffers
> equal
> >> to the size that's being BZERO'd.
> >>
> >> Is si->cpudata[i] NULL or something?
>
> (gdb) p si->cpudata
> $1 = {0xa56400, 0xa56800, 0xa56c00, 0xa57000, 0x0 <repeats 252
> times>}
> (gdb) p si->cpudata[0]
> $4 = (ulong *) 0xa56400
OK, so if "i" is 0 at the time, then I don't understand how the
BZERO/memset can segfault while zero'ing out memory starting at
address 0xa56400?
BZERO(si->cpudata[i], sizeof(ulong) * vt->kmem_max_limit);
Even if it over-ran the 0x400 bytes that's been allocated to
si->cpuinfo[0], it would still harmlessly run into the buffer
that was allocated for si->cpuinfo[1]. What's the bad address
it's faulting on?
And for sanity's sake, what is the crash utility's vm_table.kmem_max_limit
equal to, and what architecture are you running on?
>
> > Also, can you confirm that you are always using the exact vmlinux
> > that is associated with each vmcore/live-system? I mean you're
> > not using a System.map command line argument, right?
>
> Yes, I'm using the exact vmlinux. Not using any arguments for live
> crash; I am for the vmcore runs but that seems needed given crash's
> [mapfile] [namelist] [dumpfile] argument parsing.
>
> I use a redhat-style kernel rpm build process (with a more advanced
> kernel .spec file); so I have debuginfo packages to match all my
> kernels.
OK cool -- so you know what you're doing. ;-)
BTW, if need be, would you be able to make the vmlinux/vmcore pair
available for download somewhere? (You can contact me off-list with
the particulars...)
Dave
16 years, 2 months