Re: [Crash-utility] crash does not get proper backtrace?

Tuesday, 7 September 2010

----- hutao(a)cn.fujitsu.com wrote:

...
 On Thu, Sep 02, 2010 at 08:55:38AM -0400, Dave Anderson wrote:
 > 
 > ----- hutao(a)cn.fujitsu.com wrote:
 > 
 > > On Thu, Sep 02, 2010 at 03:46:00PM +0800, hutao(a)cn.fujitsu.com
 wrote:
 > > > Hi,
 > > > 
 > > > I got a problem where it seemed crash got a bad backtrace.
 > > > The problem occurred under the following conditions:
 > > > On a qemu guest system loading a module that stuck at
 > > > the init function(say, call a function that did deadlooping),
 > > > then dumped the guest by `virsh dump vm dumpfile', and run
 > > > crash on the dumpfile.
 > > > 
 > > > The module is:
 > > > 
 > > > ---
 > > > #include <linux/module.h>
 > > > 
 > > > int endless_loop(void)
 > > > {
 > > > 	printk("endless loop\n");
 > > > 	while (1);
 > > > 
 > > > 	return 0;
 > > > }
 > > > 
 > > > int __init endless_init(void)
 > > > {
 > > > 	endless_loop();
 > > > 
 > > > 	return 0;
 > > > }
 > > > module_init(endless_init);
 > > > 
 > > > MODULE_LICENSE("GPL");
 > > > ---
 > > > 
 > > > crash bt command got:
 > > > 
 > > > crash> bt -a
 > > > PID: 0      TASK: ffffffff81648020  CPU: 0   COMMAND: "swapper"
 > > >  #0 [ffffffff81601e08] schedule at ffffffff813e8a49
 > > >  #1 [ffffffff81601e18] apic_timer_interrupt at ffffffff8100344e
 > > >  #2 [ffffffff81601ea0] need_resched at ffffffff8100970c
 > > >  #3 [ffffffff81601eb0] default_idle at ffffffff81009f6b
 > > >  #4 [ffffffff81601ec0] cpu_idle at ffffffff81001bf5
 > > > 
 > > > PID: 1088   TASK: ffff88001dda2d60  CPU: 1   COMMAND: "insmod"
 > > >  #0 [ffff88001e751dc8] schedule at ffffffff813e8a49
 > > >  #1 [ffff88001e751dd0] schedule at ffffffff813e8aec
 > > >  #2 [ffff88001e751e80] preempt_schedule_irq at ffffffff813e8c90
 > > >  #3 [ffff88001e751e90] retint_kernel at ffffffff813eab86
 > > >  #4 [ffff88001e751f20] do_one_initcall at ffffffff81000210
 > > >  #5 [ffff88001e751f50] sys_init_module at ffffffff8106b7ca
 > > >  #6 [ffff88001e751f80] system_call_fastpath at ffffffff81002a82
 > > >     RIP: 00007f761bb58b7a  RSP: 00007fff67a43120  RFLAGS: 00010206
 > > >     RAX: 00000000000000af  RBX: ffffffff81002a82  RCX: 0000000000020010
 > > >     RDX: 0000000000b96010  RSI: 00000000000163da  RDI: 0000000000b96030
 > > >     RBP: 0000000000b96010   R8: 0000000000010011   R9: 0000000000080000
 > > >     R10: 00007f761bb4b140  R11: 0000000000000202  R12: 00000000000163da
 > > >     R13: 00007fff67a44985  R14: 00000000000163da  R15: 0000000000b96010
 > > >     ORIG_RAX: 00000000000000af  CS: 0033  SS: 002b
 > > > 
 > > > Does it lose some function calls between do_one_initcall and
retint_kernel?
 > > > (endless_loop <- endless_init)
 > > > 
 > > 
 > > In addition, if we don't stick in the init function (there is still a
deadloop
 > > somewhere in module but triggered by, say, reading a /proc file) then the
backtrace
 > > outputed by crash is correct.
 > 
 > When you say "correct", I presume that you see your module functions as
frames.
 > But if you also see the backtrace starting with "schedule", then it's
just luck
 > that the backtrace bumped into your module functions.  It just so happened that
 > when walking back from schedule(), it "mistakenly" stumbled upon your
module's
 > functions.  
 > 
 > In the example above, I presume that when trying to backtrace from retint_kernel(),
 > it stepped over your module's "loop" functions that were called via
do_one_initcall().
 > That's why I suggest that you should probably see them on the kernel stack in 
 > between ffff88001e751e90 and ffff88001e751f20 if you use "bt -t".  That is
what
 > "bt -t" is for -- the "bt" command is never guaranteed to be
correct.

 Nothing new in between ffff88001e751e90 and ffff88001e751f20 with `bt -at':

 crash> bt -at
 PID: 0      TASK: ffffffff81648020  CPU: 0   COMMAND: "swapper"
               START: schedule at ffffffff813e8a49
   [ffffffff81601e18] apic_timer_interrupt at ffffffff8100344e
   [ffffffff81601e60] __atomic_notifier_call_chain at ffffffff813ed799
   [ffffffff81601ea0] need_resched at ffffffff8100970c
   [ffffffff81601eb0] default_idle at ffffffff81009f6b
   [ffffffff81601ec0] cpu_idle at ffffffff81001bf5
   [ffffffff81601f10] rest_init at ffffffff813d72ec
   [ffffffff81601f30] start_kernel at ffffffff816e1d77
   [ffffffff81601f40] command_line at ffffffff81718e90
   [ffffffff81601f70] x86_64_start_reservations at ffffffff816e12ac
   [ffffffff81601f90] x86_64_start_kernel at ffffffff816e13a8

 PID: 1088   TASK: ffff88001dda2d60  CPU: 1   COMMAND: "insmod"
               START: schedule at ffffffff813e8a49
   [ffff88001e751dd0] schedule at ffffffff813e8aec
   [ffff88001e751e40] get_parent_ip at ffffffff8102f193
   [ffff88001e751e60] sub_preempt_count at ffffffff813ed62c
   [ffff88001e751e70] need_resched at ffffffff8102a975
   [ffff88001e751e80] preempt_schedule_irq at ffffffff813e8c90
   [ffff88001e751e90] retint_kernel at ffffffff813eab86
   [ffff88001e751f20] do_one_initcall at ffffffff81000210
   [ffff88001e751f50] sys_init_module at ffffffff8106b7ca
   [ffff88001e751f80] system_call_fastpath at ffffffff81002a82
     RIP: 00007f761bb58b7a  RSP: 00007fff67a43120  RFLAGS: 00010206
     RAX: 00000000000000af  RBX: ffffffff81002a82  RCX: 0000000000020010
     RDX: 0000000000b96010  RSI: 00000000000163da  RDI: 0000000000b96030
     RBP: 0000000000b96010   R8: 0000000000010011   R9: 0000000000080000
     R10: 00007f761bb4b140  R11: 0000000000000202  R12: 00000000000163da
     R13: 00007fff67a44985  R14: 00000000000163da  R15: 0000000000b96010
     ORIG_RAX: 00000000000000af  CS: 0033  SS: 002b
 crash> 

 This is the related part of output of `bt -ar' in case you're
 interested:

 ffff88001e751e60:  sub_preempt_count+146 ffff88001e751e78 
 ffff88001e751e70:  need_resched+30  ffff88001e751e88 
 ffff88001e751e80:  preempt_schedule_irq+82 ffff88001e751f18 
 ffff88001e751e90:  retint_kernel+38 ffff88001e751dd8 
 ffff88001e751ea0:  0000000000000000 0000000000000004 
 ffff88001e751eb0:  0000000000000000 ffff88001e751fd8 
 ffff88001e751ec0:  0000000000000000 0000000000000001 
 ffff88001e751ed0:  0000000000000000 ffffffffa00f9000 
 ffff88001e751ee0:  ffffffffffffff10 ffffffffa00f9004 
 ffff88001e751ef0:  0000000000000010 0000000000000246 
 ffff88001e751f00:  ffff88001e751f18 0000000000000018 
 ffff88001e751f10:  ffff8800ffffffff ffff88001e751f48 
 ffff88001e751f20:  do_one_initcall+122 __this_module    
 ffff88001e751f30:  0000000000000000 0000000000020000 
 ffff88001e751f40:  0000000000b96030 ffff88001e751f78 
 ffff88001e751f50:  sys_init_module+196 0000000000b96010 
 ffff88001e751f60:  00000000000163da 00007fff67a44985 
 ffff88001e751f70:  00000000000163da 0000000000b96010 
 ffff88001e751f80:  system_call_fastpath+22 0000000000000202 
 ffff88001e751f90:  00007f761bb4b140 0000000000080000 
 ffff88001e751fa0:  0000000000010011 00000000000000af 
 ffff88001e751fb0:  0000000000020010 0000000000b96010 
 ffff88001e751fc0:  00000000000163da 0000000000b96030 
 ffff88001e751fd0:  00000000000000af 00007f761bb58b7a 
 ffff88001e751fe0:  0000000000000033 0000000000010206 
 ffff88001e751ff0:  00007fff67a43120 000000000000002b 

 and `sym -m endless':

 crash> sym -m endless
 ffffffffa00f6000 MODULE START: endless
 ffffffffa00f6000 (?) endless_loop  
 ffffffffa00f6030 (r) __ksymtab_endless_loop  
 ffffffffa00f6040 (r) __kstrtab_endless_loop  
 ffffffffa00f6050 (d) __this_module  
 ffffffffa00f6345 MODULE END: endless
 crash>  
Well in that case, there is no evidence left on the kernel stack
by the endless_loop() function, i.e., the return address back
to endless_loop() from from its call to printk().

Dave

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005