Re: [Crash-utility] infinite loop in crash due to double-NMI on x86_64 system

Monday, 28 June 2010

...
 -----Original Message-----
 From: crash-utility-bounces(a)redhat.com 
 [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
 Sent: Monday, June 28, 2010 1:35 PM
 To: Discussion list for crash utility usage,maintenance and 
 development
 Subject: Re: [Crash-utility] infinite loop in crash due to 
 double-NMI on x86_64 system

 ----- "Lucas Silacci" <Lucas.Silacci(a)teradata.com&gt; wrote:

 > > -----Original Message-----
 > > From: crash-utility-bounces(a)redhat.com 
 > > [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave
 > Anderson
 > > Sent: Monday, June 28, 2010 12:11 PM
 > > To: Discussion list for crash utility usage,maintenance and 
 > > development
 > > Subject: Re: [Crash-utility] infinite loop in crash due to 
 > > double-NMI on x86_64 system
 > > 
 > > 
 > >   
 > > ----- "Lucas Silacci" <Lucas.Silacci(a)teradata.com&gt; wrote:
 > > 
 > > > Below is the output of running crash (with the patch) 
 against one
 > of
 > > > these dumps.
 > > > 
 > > > -Lucas
 > > > 
 > > > 
 > > > crash 5.0.5
 > > > Copyright (C) 2002-2010  Red Hat, Inc.
 > > > Copyright (C) 2004, 2005, 2006  IBM Corporation
 > > > Copyright (C) 1999-2006  Hewlett-Packard Co    
 > > > Copyright (C) 2005, 2006  Fujitsu Limited      
 > > > Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
 > > > Copyright (C) 2005  NEC Corporation                  
 > > > Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
 > > > Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux,
 > Inc.
 > > > This program is free software, covered by the GNU 
 General Public License,
 > > > and you are welcome to change it and/or distribute 
 copies of it under 
 > > > certain conditions.  Enter "help copying" to see the
conditions.
 > > > This program has absolutely no warranty.  Enter "help 
 warranty" for
 > > > details.
 > > > 
 > > > GNU gdb (GDB) 7.0
 > > > Copyright (C) 2009 Free Software Foundation, Inc. 
 License GPLv3+: GNU GPL version 3 or later
 > > > <http://gnu.org/licenses/gpl.html>
 > > > This is free software: you are free to change and 
 redistribute it.
 > > > There is NO WARRANTY, to the extent permitted by law.  
 Type "show copying"   
 > > > and "show warranty" for details.
 > > > 
 > > > This GDB was configured as "x86_64-unknown-linux-gnu"...
 > > > 
 > > > please wait... (determining panic task)                     
 > >           
 > > > 
 > > > WARNING: Loop detected in the NMI Exception Stack!          
 > >           
 > > > 
 > > > 
 > > > bt: cannot transition from exception stack to current process
 > stack:
 > > >     exception stack pointer: ffffffff8046dc50           

 >  
 > > >       process stack pointer: ffffffff8046ddd8
 > > >          current stack base: ffffffff80422000
 > > > 
 > > >   SYSTEM MAP:
 > /boot/System.map-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
 > > > DEBUG KERNEL: /boot/vmlinux-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
 > > > (2.6.16.53-0.8.PTF.434477.9.TDC.0-smp)
 > > >     DUMPFILE: /var/crash/lucas.save/vmcore  [PARTIAL DUMP]
 > > >         CPUS: 4
 > > >         DATE: Tue May 18 12:46:07 2010
 > > >       UPTIME: 07:24:54
 > > > LOAD AVERAGE: 85.74, 82.85, 82.29
 > > >        TASKS: 2449
 > > >     NODENAME: POLO5_1-9
 > > >      RELEASE: 2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
 > > >      VERSION: #1 SMP Fri Aug 31 06:07:27 PDT 2007
 > > >      MACHINE: x86_64  (2660 Mhz)
 > > >       MEMORY: 7.9 GB
 > > >        PANIC: "Kernel panic - not syncing: dumpsw: Dump 
 > > switch pushed; reason: 0x20  args=0xffffffff8046df08"
 > > >          PID: 0
 > > >      COMMAND: "swapper"
 > > >         TASK: ffffffff8038c340  (1 of 4)  [THREAD_INFO: 
 > > ffffffff80422000]
 > > >          CPU: 0
 > > >        STATE: TASK_RUNNING (PANIC)
 > > > 
 > > > crash> bt
 > > > PID: 0      TASK: ffffffff8038c340  CPU: 0   COMMAND: "swapper"
 > > >  #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
 > > >  #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
 > > >  #2 [ffffffff8046dde0] panic at ffffffff801327fa
 > > >  #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
 > > >  #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
 > > >  #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
 > > >  #6 [ffffffff8046df40] do_nmi at ffffffff80323365
 > > >  #7 [ffffffff8046df50] nmi at ffffffff8032268f
 > > >     [exception RIP: smp_send_stop+84]
 > > >     RIP: ffffffff80116e44  RSP: ffffffff8046ddd8  RFLAGS:
 > 00000246
 > > >     RAX: 00000000000000ff  RBX: ffffffff8831c1f8  RCX: 
 > > 000041049c7256e8
 > > >     RDX: 0000000000000005  RSI: 000000005238a938  RDI: 
 > > 00000000002896a0
 > > >     RBP: ffffffff8046df08   R8: 00000000000040fb   R9: 
 > > 000000005238a7e8
 > > >     R10: 0000000000000002  R11: 0000ffff0000ffff  R12: 
 > > 000000000000000c
 > > >     R13: 0000000000000000  R14: 0000000000000000  R15: 
 > > 0000000000000000
 > > >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 > > > --- <NMI exception stack> ---
 > > >  #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
 > > > bt: WARNING: Loop detected in the NMI Exception Stack!
 > > > bt: cannot transition from exception stack to current process
 > stack:
 > > >     exception stack pointer: ffffffff8046dc50
 > > >       process stack pointer: ffffffff8046ddd8
 > > >          current stack base: ffffffff80422000
 > > > crash> 
 > >  
 > > What exactly was the sequence of events?  Was the system 
 repeatedly and
 > > erroneously running one NMI after another for some 
 reason, and *then* the
 > > "dump switch" was pressed?  And the dumpsw_notify() 
 function sends another
 > > NMI?  And where does that dumpsw_notify() function live anyway?
 > > 
 > > I'm just trying to get a grip on whether this will ever 
 happen again, or
 > > whether it's fixing a one-time hardware abnormality?
 > > 
 > > Dave
 > >
 > 
 > As far as I am aware, we have had three separate customers encounter
 > this issue. It appears from the hardware SEL log that multiple PCI
 > SERR's came in at the same time and somehow triggered multiple NMIs.
 > You can see the SEL entries from the output of the "ipmitool sel"
 > command:
 > 
 > 0231 11FC  02  01:53:47 12/17/09  3300 04   13  EB   6F  A5 15 08 
 > Crit.
 > Interrupt   PCI SERR (PCI Bus 15 Device 1 Function 0) was asserted
 > 0232 1210  02  01:53:47 12/17/09  3300 04   13  EB   6F  A5 16 20 
 > Crit.
 > Interrupt   PCI SERR (PCI Bus 16 Device 4 Function 0) was asserted
 > 0233 1224  02  01:53:47 12/17/09  3300 04   13  EB   6F  A5 16 21 
 > Crit.
 > Interrupt   PCI SERR (PCI Bus 16 Device 4 Function 1) was asserted
 > 0234 1238  02  01:53:47 12/17/09  3300 04   13  EB   6F  A5 16 30 
 > Crit.
 > Interrupt   PCI SERR (PCI Bus 16 Device 6 Function 0) was asserted
 > 0235 124C  02  01:53:47 12/17/09  3300 04   13  EB   6F  A5 16 31 
 > Crit.
 > Interrupt   PCI SERR (PCI Bus 16 Device 6 Function 1) was asserted
 > 
 > My understanding of the architecture of the system is that 
 only one NMI
 > should have been asserted to the OS regardless of the 
 number of times
 > there was a hardware error, but clearly that wasn't the 
 case in these
 > three instances.
 > 
 > Also, it seemed like my patch made crash a little bit more 
 tolerant of
 > "corrupted" dump images which I thought could only be a good thing.

 Right, I understand that...

 But you didn't answer my questions re: the "dump switch" procedure and
 the dumpsw_notify() function.  Was the system stuck in the 
 NMI handler,
 somebody noticed the repetetive NMIs (?), and so they hit the 
 "dump switch"?
 (whatever that may be...) 

 Dave

 --
 Crash-utility mailing list
 Crash-utility(a)redhat.com
 https://www.redhat.com/mailman/listinfo/crash-utility

Sorry, guess I wasn't clear. Nobody hit the dump switch on these
systems. They simply had multiple hardware errors that apparently
triggered the NMI more than once. That's what I was trying to show with
the SEL records, that the multiple NMIs were straight from hardware with
no human intervention.

The systems went through a panic (due to multiple NMIs), a reboot, and
then crash was run on the resulting dump. In fact crash was
automatically run via a startup script and there was no human
intervention until after it was noticed that crash was filling up the
root file system with a temporary file due to the inifinite loop.

-Lucas

-Lucas

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] infinite loop in crash due to double-NMI on x86_64 system