Re: [Crash-utility] [PATCH] Add -r option to timer

Wednesday, 27 February 2013

----- Original Message -----
...
 Hello Dave,

 Changed according to your suggestions. But need to add some explanation
 here.

 change 1:
 > Perhaps there could be a way to pre-verify the addresses with
 > accessible(), and if the address is bogus, display an error message,
 > but allow the command to continue on with the other cpus?

 Error comes from rb_last() which used to calculate the space for
 printing expires.

 I fixed rb_last(), adding pre-verify the addresses. But I didn't use
 rb_last() any more, for change 2 helps to find the last hrtimer.

 change 2:
 >
 > Maybe you can use hq_open()/hq_enter()/hq_close() on the hrtimer addresses
 > to prevent this from happening, warn the user when it does, and continue on
 > with the next cpu?

 As you suggested.

 change3:
 When doing some snap test, I found rb_node's data may be destroyed, so
 I also fixed other rbtree manipulation functions to avoid abort of
 command.

Hi Qiao,

Thanks -- this second version makes it through my set of test kernels 
successfully.  The only vmcore I have that needs examination is this one,
where a softlockup timeout occurred:

 crash> bt
 PID: 7589   TASK: ffff8802292fa040  CPU: 0   COMMAND: "crash"
  #0 [ffff880028203ce0] machine_kexec at ffffffff810329db
  #1 [ffff880028203d40] crash_kexec at ffffffff810bdbd2
  #2 [ffff880028203e10] panic at ffffffff815097a9
  #3 [ffff880028203e90] watchdog_timer_fn at ffffffff810de51d
  #4 [ffff880028203ec0] __run_hrtimer at ffffffff810986ee
  #5 [ffff880028203f10] hrtimer_interrupt at ffffffff81098a56
  #6 [ffff880028203f90] smp_apic_timer_interrupt at ffffffff8151350b
  #7 [ffff880028203fb0] apic_timer_interrupt at ffffffff8100bc13
 --- <IRQ stack> ---
  #8 [ffff8802292a1df8] apic_timer_interrupt at ffffffff8100bc13
     [exception RIP: copy_user_generic_string+45]
     RIP: ffffffff81283ddd  RSP: ffff8802292a1ea0  RFLAGS: 00010246
     RAX: ffff8802292a0000  RBX: ffff8802292a1ee8  RCX: 0000000000000200
     RDX: 0000000000000000  RSI: ffff880020000000  RDI: 0000000000d91dc0
     RBP: ffffffff8100bc0e   R8: ffffffff81346e70   R9: 0000000000000001
     R10: 0000000000105c52  R11: 0000000000000246  R12: 0000000000d91dc0
     R13: 0000000000020000  R14: 0000000000000000  R15: 0000000020000000
     ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
  #9 [ffff8802292a1ea0] crash_read at ffffffff81346fbd
 #10 [ffff8802292a1ef0] vfs_read at ffffffff8117fd65
 #11 [ffff8802292a1f30] sys_read at ffffffff8117fea1
 #12 [ffff8802292a1f80] system_call_fastpath at ffffffff8100b0f2
     RIP: 00007f3a6cb7e5f0  RSP: 00007fff740b8070  RFLAGS: 00000246
     RAX: 0000000000000000  RBX: ffffffff8100b0f2  RCX: 00007f3a6cb7e5f0
     RDX: 0000000000001000  RSI: 0000000000d91dc0  RDI: 0000000000000004
     RBP: 0000000000000004   R8: 0000000020000000   R9: 0000000020000000
     R10: 0000000000105c52  R11: 0000000000000246  R12: 0000000000001000
     R13: 0000000000d91dc0  R14: 0000000000001000  R15: 0000000000d91dc0
     ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b
 crash>

And the timer command fails like so, again, without being able to
complete the rest of the cpus:

 crash> timer -r
 UPTIME: 300039(1000HZ)

 cpu: 0
  clock: 0
   .base:        ffff8800282115a8
   .offset:      1354043557752205725
   .get_time:    ktime_get_real
  EXPIRES      HRTIMER           FUNCTION    
 (empty)

  clock: 1
   .base:        ffff8800282115e8
   .offset:      0
   .get_time:    ktime_get
            EXPIRES                 HRTIMER           FUNCTION    
   300040000000-300040000000    ffff8800282116a0  ffffffff810a4b70 
<tick_sched_timer>
  3660368239466-3660368239466   ffff880224f07c68  ffffffff81071c00  <it_real_fn>
  3660700491472-3660700491472   ffff880224f07068  ffffffff81071c00  <it_real_fn>
 14461470907794-14461570907794  ffff88022750fa68  ffffffff81098160 
<hrtimer_wakeup>

  clock: 2
   .base:        ffff880028211628
   .offset:      0
   .get_time:    
 timer: invalid kernel virtual address: 2c  type: "ktime_t tv64"
 crash> 

Presumably the kernel was in the midst of tinkering with the hrtimer-related
structures, and that was what causes the bad readmem() above.  But I'm 
thinking that since softlockups are a fairly common cause of kernel crashes,
that this option should be able to handle it better than above?  

I've put copies of the vmlinux/vmcore pair above here for you to take a look:

  http://people.redhat.com/anderson/qiao

The next thing that needs to be reworked is the display itself.
I really don't like the ".base", ".offset" and
".get_time" lines,
primarily because none of the other crash commands do anything like
that.  Also, the UPTIME display is kind of strange.  It only applies to 
the "ktime_get" type clocks, right? 

I understand why you selected those particular fields -- although I can't
find a ".base" member in either the hr_clock_base or hrtimer_cpu_base
structures?  But more importantly, what if a user wants to see the other
fields in either of those two data structures?  There's no obvious
way to find their addresses. 

Anyway, for consistency with other crash commands, I think that it's
more useful to display the hrtimer_cpu_base and hr_clock_base structure
addresses, and perhaps the most important fields within them, in tabular
form, grouped together something like:

 crash> timer -r
 CPU: 0
  CLOCK: 0
  HRTIMER_CPU_BASE  HR_CLOCK_BASE   OFFSET     FUNCTION
    <address>         <address>     <value>    ktime_get
            EXPIRES                 HRTIMER           FUNCTION    
   396384243603-396384293603    ffff88003c905eb8  ffffffff8107da00 
<hrtimer_wakeup>
   397000645784-397004641782    ffff88003d2b1a58  ffffffff8107da00 
<hrtimer_wakeup>
   407501746988-407531746986    ffff88003c2b5a58  ffffffff8107da00 
<hrtimer_wakeup>
   446387171280-446446754509    ffff88003bb43908  ffffffff8107da00 
<hrtimer_wakeup>
   900000001418-900000001418    ffff8800368e53c0  ffffffff811c57d0 
<timerfd_tmrproc>

  CLOCK: 1
  HRTIMER_CPU_BASE  HR_CLOCK_BASE   OFFSET     FUNCTION
    <address>         <address>     <value>    ktime_get_real
  EXPIRES      HRTIMER           FUNCTION    
  (empty)

  CLOCK: 2
  HRTIMER_CPU_BASE  HR_CLOCK_BASE   OFFSET     FUNCTION
    <address>         <address>     <value>    ktime_get
           EXPIRES                HRTIMER           FUNCTION    
  396366000000-396366000000   ffff88003fcce6a0  ffffffff810ac4a0 
<tick_sched_timer>
  394748203985-394749203930   ffff880039f4f908  ffffffff8107da00  <hrtimer_wakeup>
  400056995196-400056995196   ffff88003fcce820  ffffffff810e2800 
<watchdog_timer_fn>
 3616558364176-3616558414176  ffff88003cf6beb8  ffffffff8107da00  <hrtimer_wakeup>

 CPU: 1
  ...

I think a display like the above shows everything that you are doing
now -- plus a bit more.

What do you think?

Dave

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] [PATCH] Add -r option to timer