New subject: kmem -[sS] segfault on 2.6.25.17

Thursday, 16 October 2008

----- "Mike Snitzer" <snitzer(a)gmail.com&gt; wrote:

...
 Frame 0 of crash's core shows:
 (gdb) bt
 #0  0x0000003b708773e0 in memset () from /lib64/libc.so.6

 I'm not sure how to get the faulting address though?  Is it just
 0x0000003b708773e0? 
No, that's the text address in memset().  If you "disass memset",
I believe that you'll see that the address above is dereferencing
the rcx register/pointer.  So then, if you enter "info registers",
you'll get a register dump, and rcx would be the failing address.

(To reproduce this, I inserted a "0xdeadbeef" into si->cpuinfo[0]
and saw the 0xdeadbeef in rcx with "info registers")

Or you can always just put an "fprintf(fp, "...")" debug statement in
that function to display the address its BZERO'ing.  Could get a
little verbose...

...

 > And for sanity's sake, what is the crash utility's
 vm_table.kmem_max_limit
 > equal to, and what architecture are you running on?

 Architecture is x86_64.

 kmem_max_limit=128, sizeof(ulong)=8; so the memset() should in fact
 be
 zero'ing all 1024 (0x400) bytes that were allocated. 
OK, so that all looks normal...

...
 So the thing is; now when I run live crash on the 2.6.25.17 devel
 kernel I no longer git a segfault!?  It still isn't happy but its at
 least not segfaulting.. very odd. 
Not necessarily...

...

 I've not rebooted the system at all either... now when I run 'kmem
 -s'
 in live crash I see:

 CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL 
 SLABS  SSIZE
 ...
 kmem: nfs_direct_cache: full list: slab: ffff810073503000  bad inuse
 counter: 5
 kmem: nfs_direct_cache: full list: slab: ffff810073503000  bad inuse
 counter: 5
 kmem: nfs_direct_cache: partial list: bad slab pointer: 88
 kmem: nfs_direct_cache: full list: bad slab pointer: 98
 kmem: nfs_direct_cache: free list: bad slab pointer: a8
 kmem: nfs_direct_cache: partial list: bad slab pointer:
 9f911029d74e35b
 kmem: nfs_direct_cache: full list: bad slab pointer: 6b6b6b6b6b6b6b6b
 kmem: nfs_direct_cache: free list: bad slab pointer: 6b6b6b6b6b6b6b6b
 kmem: nfs_direct_cache: partial list: bad slab pointer: 100000001
 kmem: nfs_direct_cache: full list: bad slab pointer: 100000011
 kmem: nfs_direct_cache: free list: bad slab pointer: 100000021
 ffff810073501600 nfs_direct_cache         192          2        40    
  2     4k
 ...
 kmem: nfs_write_data: partial list: bad slab pointer: 65676e61725f32
 kmem: nfs_write_data: full list: bad slab pointer: 65676e61725f42
 kmem: nfs_write_data: free list: bad slab pointer: 65676e61725f52
 kmem: nfs_write_data: partial list: bad slab pointer:
 74736f705f73666e
 kmem: nfs_write_data: full list: bad slab pointer: 74736f705f73667e
 kmem: nfs_write_data: free list: bad slab pointer: 74736f705f73668e
 ffff81007350a5c0 nfs_write_data           760         36        40    
  8     4k
 ...
 etc. 
Are those warnings happening on *every* slab type?  When you run on a
live system, the "shifting sands" of the kernel underneath the crash
utility can cause errors like the above.  But at least some/most of 
the other slabs' infrastructure should remain stable while the command
runs.

...

 But if I run crash against the vmcore I do get the segfault...

When you run it on the vmcore, do you get the segfault immediately?
Or do some slabs display their stats OK, but then when it deals with
one particular slab it generates the segfault?

I mean that it's possible that the target slab was in transition
at the time of the crash, in which case you might see some error
messages like you see on the live system.  But it is difficult to
explain why it's dying specifically where it is, even if the slab
was in transition.

That all being said, even if the slab was in transition, obviously
the crash utility should be able to handle it more gracefully...

...
 > BTW, if need be, would you be able to make the vmlinux/vmcore
pair
 > available for download somewhere?  (You can contact me off-list
 with
 > the particulars...)

 I can work to make that happen if needed... 
FYI, I did try our RHEL5 "debug" kernel (2.6.18 + hellofalotofpatches),
which has both CONFIG_DEBUG_SLAB and CONFIG_DEBUG_SLAB_LEAK turned on,
but I don't see the problem.  So unless something obvious can be
determined, that may be the only way I can help.

Dave

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] kmem -[sS] segfault on 2.6.25.17