----- Original Message -----
Hello,
I am using crash version: 6.0.4-2.el6 on CentOS 6.3 (kernel
2.6.32-279.el6.x86_64). I apologize for my newbie questions, but
googling did not help much.
When analyzing a kernel dump, I am getting the following bt.
crash> bt
PID: 12663 TASK: ffff88036304f500 CPU: 0 COMMAND: "bash"
#0 [ffff88035b949570] machine_kexec at ffffffff8103281b
#1 [ffff88035b9495d0] crash_kexec at ffffffff810ba662
#2 [ffff88035b9496a0] oops_end at ffffffff81501290
#3 [ffff88035b9496d0] no_context at ffffffff81043bab
#4 [ffff88035b949720] __bad_area_nosemaphore at ffffffff81043e35
#5 [ffff88035b949770] bad_area at ffffffff81043f5e
#6 [ffff88035b9497a0] __do_page_fault at ffffffff81044710
#7 [ffff88035b9498c0] do_page_fault at ffffffff8150326e
#8 [ffff88035b9498f0] page_fault at ffffffff81500625
[exception RIP: ahaann+47]
RIP: ffffffffa06ce48f RSP: ffff88035b9499a8 RFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88035daef4e0
RBP: ffff88035b9499b8 R8: 0000000004a47daf R9: ffffffffa06dae99
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000007
R13: 00007fc82f4b8000 R14: 000000000000000a R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff88035b9499c0] ahaecho at ffffffffa06d2899 [ahadrv]
#10 [ffff88035b949a00] writectl at ffffffffa06c366e [ahadrv]
#11 [ffff88035b949e40] writeaha at ffffffffa06d3e7b [ahadrv]
#12 [ffff88035b949e60] proc_file_write at ffffffff811e6e44
#13 [ffff88035b949ea0] proc_reg_write at ffffffff811e0abe
#14 [ffff88035b949ef0] vfs_write at ffffffff8117b068
#15 [ffff88035b949f30] sys_write at ffffffff8117ba81
#16 [ffff88035b949f80] system_call_fastpath at ffffffff8100b0f2
RIP: 0000003a29ada3c0 RSP: 00007ffffaec6830 RFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffffff8100b0f2 RCX: 0000000000000065
RDX: 000000000000000a RSI: 00007fc82f4b8000 RDI: 0000000000000001
RBP: 00007fc82f4b8000 R8: 000000000000000a R9: 00007fc82f4aa700
R10: 00000000fffffff7 R11: 0000000000000246 R12: 000000000000000a
R13: 0000003a29d8c780 R14: 000000000000000a R15: 0000000001e18460
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
crash>
1. Are the hex addr in [] right before the function name the stack
frame ptr for that function?
On x86_64 machines, the "at <address>" shown is the address in that
frame's
function where the call instruction that it has made will return to. So for
example, taking frame #15, where "sys_write at ffffffff8117ba81" has called
vfs_write(), you can disassemble all instructions from the beginning of
sys_write() to that address like this example:
crash> dis -r ffffffff80016e6b
0xffffffff80016e26 <sys_write>: push %r13
0xffffffff80016e28 <sys_write+2>: mov %rsi,%r13
0xffffffff80016e2b <sys_write+5>: push %r12
0xffffffff80016e2d <sys_write+7>: mov $0xfffffffffffffff7,%r12
0xffffffff80016e34 <sys_write+14>: push %rbp
0xffffffff80016e35 <sys_write+15>: mov %rdx,%rbp
0xffffffff80016e38 <sys_write+18>: push %rbx
0xffffffff80016e39 <sys_write+19>: sub $0x18,%rsp
0xffffffff80016e3d <sys_write+23>: lea 0x14(%rsp),%rsi
0xffffffff80016e42 <sys_write+28>: callq 0xffffffff8000b5b4
<fget_light>
0xffffffff80016e47 <sys_write+33>: test %rax,%rax
0xffffffff80016e4a <sys_write+36>: mov %rax,%rbx
0xffffffff80016e4d <sys_write+39>: je 0xffffffff80016e86
<sys_write+96>
0xffffffff80016e4f <sys_write+41>: mov 0x38(%rax),%rax
0xffffffff80016e53 <sys_write+45>: lea 0x8(%rsp),%rcx
0xffffffff80016e58 <sys_write+50>: mov %rbp,%rdx
0xffffffff80016e5b <sys_write+53>: mov %r13,%rsi
0xffffffff80016e5e <sys_write+56>: mov %rbx,%rdi
0xffffffff80016e61 <sys_write+59>: mov %rax,0x8(%rsp)
0xffffffff80016e66 <sys_write+64>: callq 0xffffffff800164d0
<vfs_write>
0xffffffff80016e6b <sys_write+69>: mov %rax,%r12
crash>
And the stack address of the frame contains that return address location.
2. I am assuming the panic occurred in function ahaann() (and not in
ahaecho() ). Is that right?
That's correct. The exception occurred precisely when executing the
instruction here: [exception RIP: ahadrv], which is at RIP ffffffffa06ce48f.
You can do a "dis -r ahaann+47" to see the instructions leading up
to the fatal one. If you load the ahadrv module with "mod -s ahadrv",
you can also get line numbers interspersed with "dis -rl ahadrv+47"
3. What is puzzling me is why there is no frame associated with call
to ahaann(). Or is frame #8 associated to ahaann(). From the display
it seems frame #8 is associated to page_fault() since 0xffffffff81500625
is an address in page_fault(). Or am totally misinterpreting the call stack.
crash> dis ffffffff81500625
0xffffffff81500625 <page_fault+37>: jmpq 0xffffffff81500830
The ahaann() function didn't lay down a full frame because while it
was executing, it took a page fault exception. As soon as that
occurred, an exception frame was dumped onto the stack at that
point (the register dump). Control at that point was transferred
to page_fault() to handle the exception. Normally the exception
should quietly resolve the page fault, return back to ahaann(),
and the function should continue on. But the address that caused
the page fault was bogus/unresolvable, so it never returned, but
rather crashed the system.
So again, what you should do is:
crash> mod -s ahadrv (presuming you've got the kernel-debuginfo package
installed)
...
crash> dis -rl ahaann+47
And look at the last instruction shown. My guess is that it's
referencing a location with a NULL pointer (probably via one of
the NULL-filled RBX, RCX, RDX, RSI or RDI registers)?
4. I can understand the value of register dump for frame #8, due to
the panic. What is the significance of the register dump for frame
#16.
Whenever a program running in user-space enters the kernel, it did
so as the result of an exception, be it a system call, page fault,
interrupt, etc. And like the in-kernel page fault exception, it lays
down the user's register set at the top of the stack so they can be
restored upon return to user-space.
Dave