Re: [Crash-utility] infinite loop in crash due to double-NMI on x86_64 system

Monday, 28 June 2010

----- "Lucas Silacci" <Lucas.Silacci(a)teradata.com&gt; wrote:

...
 Below is the output of running crash (with the patch) against one of
 these dumps.

 -Lucas

 crash 5.0.5
 Copyright (C) 2002-2010  Red Hat, Inc.
 Copyright (C) 2004, 2005, 2006  IBM Corporation
 Copyright (C) 1999-2006  Hewlett-Packard Co    
 Copyright (C) 2005, 2006  Fujitsu Limited      
 Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
 Copyright (C) 2005  NEC Corporation                  
 Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
 Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
 This program is free software, covered by the GNU General Public License,
 and you are welcome to change it and/or distribute copies of it under 
 certain conditions.  Enter "help copying" to see the conditions.

 This program has absolutely no warranty.  Enter "help warranty" for
 details.

 GNU gdb (GDB) 7.0
 Copyright (C) 2009 Free Software Foundation, Inc.
 License GPLv3+: GNU GPL version 3 or later
 <http://gnu.org/licenses/gpl.html>
 This is free software: you are free to change and redistribute it.
 There is NO WARRANTY, to the extent permitted by law.  Type "show copying"   
 and "show warranty" for details.

 This GDB was configured as "x86_64-unknown-linux-gnu"...

 please wait... (determining panic task)                               

 WARNING: Loop detected in the NMI Exception Stack!                    

 bt: cannot transition from exception stack to current process stack:
     exception stack pointer: ffffffff8046dc50                       
       process stack pointer: ffffffff8046ddd8
          current stack base: ffffffff80422000

   SYSTEM MAP: /boot/System.map-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
 DEBUG KERNEL: /boot/vmlinux-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
 (2.6.16.53-0.8.PTF.434477.9.TDC.0-smp)
     DUMPFILE: /var/crash/lucas.save/vmcore  [PARTIAL DUMP]
         CPUS: 4
         DATE: Tue May 18 12:46:07 2010
       UPTIME: 07:24:54
 LOAD AVERAGE: 85.74, 82.85, 82.29
        TASKS: 2449
     NODENAME: POLO5_1-9
      RELEASE: 2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
      VERSION: #1 SMP Fri Aug 31 06:07:27 PDT 2007
      MACHINE: x86_64  (2660 Mhz)
       MEMORY: 7.9 GB
        PANIC: "Kernel panic - not syncing: dumpsw: Dump switch pushed; reason: 0x20 
args=0xffffffff8046df08"
          PID: 0
      COMMAND: "swapper"
         TASK: ffffffff8038c340  (1 of 4)  [THREAD_INFO: ffffffff80422000]
          CPU: 0
        STATE: TASK_RUNNING (PANIC)

 crash> bt
 PID: 0      TASK: ffffffff8038c340  CPU: 0   COMMAND: "swapper"
  #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
  #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
  #2 [ffffffff8046dde0] panic at ffffffff801327fa
  #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
  #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
  #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
  #6 [ffffffff8046df40] do_nmi at ffffffff80323365
  #7 [ffffffff8046df50] nmi at ffffffff8032268f
     [exception RIP: smp_send_stop+84]
     RIP: ffffffff80116e44  RSP: ffffffff8046ddd8  RFLAGS: 00000246
     RAX: 00000000000000ff  RBX: ffffffff8831c1f8  RCX: 000041049c7256e8
     RDX: 0000000000000005  RSI: 000000005238a938  RDI: 00000000002896a0
     RBP: ffffffff8046df08   R8: 00000000000040fb   R9: 000000005238a7e8
     R10: 0000000000000002  R11: 0000ffff0000ffff  R12: 000000000000000c
     R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 --- <NMI exception stack> ---
  #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
 bt: WARNING: Loop detected in the NMI Exception Stack!
 bt: cannot transition from exception stack to current process stack:
     exception stack pointer: ffffffff8046dc50
       process stack pointer: ffffffff8046ddd8
          current stack base: ffffffff80422000
 crash>   
What exactly was the sequence of events?  Was the system repeatedly and
erroneously running one NMI after another for some reason, and *then* the
"dump switch" was pressed?  And the dumpsw_notify() function sends another
NMI?  And where does that dumpsw_notify() function live anyway?

I'm just trying to get a grip on whether this will ever happen again, or
whether it's fixing a one-time hardware abnormality?

Dave

...
 -----Original Message-----
 From: crash-utility-bounces(a)redhat.com
 [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
 Sent: Friday, June 25, 2010 12:32 PM
 To: Discussion list for crash utility usage,maintenance and
 development
 Subject: Re: [Crash-utility] infinite loop in crash due to double-NMI
 on
 x86_64 system

 ----- "Lucas Silacci" <Lucas.Silacci(a)teradata.com&gt; wrote:

 > Hi,
 >  
 > I've run into an issue where crash will enter an infinite loop
 while
 > decoding exception stacks if those stacks get corrupted.
 >  
 > We've seen this on four different systems where the hardware
 generated
 > multiple NMIs and the second and subsequent NMIs caused the NMI
 > exception stack to be overwritten. When this condition is hit, the
 > bottom rsp on the NMI exception stack (which would normally point
 you
 > back to the kernel thread stack or possibly a different exception
 stack)
 > points you back into the middle of the same NMI exception stack.
 This
 > causes crash to infinitely loop when it tries to decode that
 exception
 > stack.
 >  
 > Now clearly the root cause of the issue is faulty hardware that
 > generated multiple NMIs. However a very small change in crash can
 detect
 > this issue and stop the infinite loop from happening thereby
 allowing
 > you to get to a point in crash where you can actually tell that it
 was
 > an NMI that caused the system to dump.
 >  
 > The patch is attached to this email. For x86_64 it will detect the
 > condition of any exception stack that points back at itself.
 >  
 > Please feel free to ask me any questions on this.

 Wow, that's pretty interesting -- I've certainly never seen that
 before.
 Can you show me what the backtrace looks like with your patch
 applied?

 Thanks,
   Dave

 --
 Crash-utility mailing list
 Crash-utility(a)redhat.com
 https://www.redhat.com/mailman/listinfo/crash-utility

 --
 Crash-utility mailing list
 Crash-utility(a)redhat.com
 https://www.redhat.com/mailman/listinfo/crash-utility 

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] infinite loop in crash due to double-NMI on x86_64 system