Re: [Crash-utility] Interpreting bt

Thursday, 24 January 2013

----- Original Message -----
...

 Thank you very much for the info, really helpful and very much
 apprecaited. I have a few follow on questions:

 1. When the page fault occurs, is some of the registers (which might
 contain parameters passed to the offending function) trampled on? If
 yes, is there a document or would you happen to know what registers
 (in the worst case) are written to.

 The reason I ask is in my dump below the register - RDI (used to pass
 the first param to ahaahh() ) should be zero (to have caused the
 page fault), but it is not. 
RDI was originally passed into ahaann() as an argument, and the evidence
shows that it had a value of NULL.  However, it was subsequently needed
as an argument register for the call to ahahtl() at ahaann+37.  So before
being reused, RDX was copied/saved in RBX at ahaann+22.  And then RDX was
overwritten/reused at ahaann+28:

...

 From register dump after panic:
 RBX: 0000000000000000 RDI: ffff88035daef4e0 (I expect this to be zero
 per the dis-assembly code).

 Reverse dis-assembly from RP when panic occurred:
 crash> dis -r ffffffffa06ce48f
 0xffffffffa06ce460 <ahaann>: push %rbp
 0xffffffffa06ce461 <ahaann+1>: mov %rsp,%rbp
 0xffffffffa06ce464 <ahaann+4>: push %r12
 0xffffffffa06ce466 <ahaann+6>: push %rbx
 0xffffffffa06ce467 <ahaann+7>: nopl 0x0(%rax,%rax,1)
 0xffffffffa06ce46c <ahaann+12>: mov $0xffffffffa092c548,%rdx
 0xffffffffa06ce473 <ahaann+19>: movzwl %si,%ecx
 0xffffffffa06ce476 <ahaann+22>: mov %rdi,%rbx <==========
 0xffffffffa06ce479 <ahaann+25>: mov %esi,%r12d
 0xffffffffa06ce47c <ahaann+28>: mov $0xffffffffa092e5f0,%rdi
 0xffffffffa06ce483 <ahaann+35>: xor %esi,%esi
 0xffffffffa06ce485 <ahaann+37>: callq 0xffffffffa06cd860 <ahahtl>
 0xffffffffa06ce48a <ahaann+42>: test %rax,%rax
 0xffffffffa06ce48d <ahaann+45>: jne 0xffffffffa06ce500 <ahaann+160>
 0xffffffffa06ce48f <ahaann+47>: mov (%rbx),%rdi <========== 
And so in your case, the page fault was caused by the NULL pointer
in RBX, which was originally passed into the function in RDI.

...
 2. Does Linux (specifically crash) treat access to invalid address
or
 NULL ptr dereference the same way, as in calling them both page
 fault? (In one of my past work places, the crash dump was explicit
 is stating when a NULL ptr dereference occurred, and I am wondering
 now if that was due to a customization in crash). 
The crash utility doesn't have anything to do with it -- it simply
trying to resurrect what happened by what it sees left on the stack.

The kernel will transition to page_fault() on either a NULL pointer
or an invalid address (although sometimes an invalid address will
generate a general protection fault exception if certain bits are
set in the bad address).  

If you do a "log" command, you will see a string that precedes the
final blurb containing the register dump and backtrace that will
also confirm what kind of exception occurred.  Your's probably
says: 

 BUG: unable to handle kernel NULL pointer dereference at (null)

which gets generated here in the kernel's show_fault_oops() function:

        printk(KERN_ALERT "BUG: unable to handle kernel ");
        if (address < PAGE_SIZE)
                printk(KERN_CONT "NULL pointer dereference");
        else
                printk(KERN_CONT "paging request");

        printk(KERN_CONT " at %p\n", (void *) address);
        printk(KERN_ALERT "IP:");
        printk_address(regs->ip, 1);

...

 3. Expanding on the meaning of the address in [] at the beginning of each line of the bt

 [addr0] function0 at addr2
 [addr1] function1 at addr2

 addr1 - 8 : starting address of the stack frame from function1 upto
 the addr0. I can use this info to peek into the values of function
 local variables pushed onto the stack (specifically the function's
 stack frame). 
Exactly -- you can use "bt -f" or "bt -F" to do just that, where -f 
just dumps the raw stack frame data, whereas -F also translates the 
stack contents into known variable names/offsets, or into the slab cache
that it came from if either case is applicable.

For example:

  crash> bt
  ...
  #12 [ffff880037cb9ef0] vfs_write at ffffffff81172718
  #13 [ffff880037cb9f30] sys_write at ffffffff81173151
  ...

  crash> bt -f
  ...
  #12 [ffff880037cb9ef0] vfs_write at ffffffff81172718
      ffff880037cb9ef8: ffff880037cb9f78 ffffffff810d1b62 
      ffff880037cb9f08: ffff880078056260 ffff8800781248c0 
      ffff880037cb9f18: 00007f9b6f177000 0000000000000002 
      ffff880037cb9f28: ffff880037cb9f78 ffffffff81173151 
  #13 [ffff880037cb9f30] sys_write at ffffffff81173151
  ...

  crash> bt -F
  ...
  #12 [ffff880037cb9ef0] vfs_write at ffffffff81172718
      ffff880037cb9ef8: ffff880037cb9f78 audit_syscall_entry+626 
      ffff880037cb9f08: [size-1024]      [filp]           
      ffff880037cb9f18: 00007f9b6f177000 0000000000000002 
      ffff880037cb9f28: ffff880037cb9f78 sys_write+81     
  #13 [ffff880037cb9f30] sys_write at ffffffff81173151
  ...

Often times the [slab-cache] or symbol+offset references can 
help pinpoint a local variable.

Dave

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] Interpreting bt