-----Original Message-----
From: crash-utility-bounces(a)redhat.com
[mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
Sent: Monday, June 28, 2010 1:35 PM
To: Discussion list for crash utility usage,maintenance and
development
Subject: Re: [Crash-utility] infinite loop in crash due to
double-NMI on x86_64 system
----- "Lucas Silacci" <Lucas.Silacci(a)teradata.com> wrote:
> > -----Original Message-----
> > From: crash-utility-bounces(a)redhat.com
> > [mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave
> Anderson
> > Sent: Monday, June 28, 2010 12:11 PM
> > To: Discussion list for crash utility usage,maintenance and
> > development
> > Subject: Re: [Crash-utility] infinite loop in crash due to
> > double-NMI on x86_64 system
> >
> >
> >
> > ----- "Lucas Silacci" <Lucas.Silacci(a)teradata.com> wrote:
> >
> > > Below is the output of running crash (with the patch)
against one
> of
> > > these dumps.
> > >
> > > -Lucas
> > >
> > >
> > > crash 5.0.5
> > > Copyright (C) 2002-2010 Red Hat, Inc.
> > > Copyright (C) 2004, 2005, 2006 IBM Corporation
> > > Copyright (C) 1999-2006 Hewlett-Packard Co
> > > Copyright (C) 2005, 2006 Fujitsu Limited
> > > Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
> > > Copyright (C) 2005 NEC Corporation
> > > Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
> > > Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux,
> Inc.
> > > This program is free software, covered by the GNU
General Public License,
> > > and you are welcome to change it and/or distribute
copies of it under
> > > certain conditions. Enter "help copying" to see the
conditions.
> > > This program has absolutely no warranty. Enter "help
warranty" for
> > > details.
> > >
> > > GNU gdb (GDB) 7.0
> > > Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
> > > <
http://gnu.org/licenses/gpl.html>
> > > This is free software: you are free to change and
redistribute it.
> > > There is NO WARRANTY, to the extent permitted by law.
Type "show copying"
> > > and "show warranty" for details.
> > >
> > > This GDB was configured as "x86_64-unknown-linux-gnu"...
> > >
> > > please wait... (determining panic task)
> >
> > >
> > > WARNING: Loop detected in the NMI Exception Stack!
> >
> > >
> > >
> > > bt: cannot transition from exception stack to current process
> stack:
> > > exception stack pointer: ffffffff8046dc50
>
> > > process stack pointer: ffffffff8046ddd8
> > > current stack base: ffffffff80422000
> > >
> > > SYSTEM MAP:
> /boot/System.map-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> > > DEBUG KERNEL: /boot/vmlinux-2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> > > (2.6.16.53-0.8.PTF.434477.9.TDC.0-smp)
> > > DUMPFILE: /var/crash/lucas.save/vmcore [PARTIAL DUMP]
> > > CPUS: 4
> > > DATE: Tue May 18 12:46:07 2010
> > > UPTIME: 07:24:54
> > > LOAD AVERAGE: 85.74, 82.85, 82.29
> > > TASKS: 2449
> > > NODENAME: POLO5_1-9
> > > RELEASE: 2.6.16.53-0.8.PTF.434477.9.TDC.0-smp
> > > VERSION: #1 SMP Fri Aug 31 06:07:27 PDT 2007
> > > MACHINE: x86_64 (2660 Mhz)
> > > MEMORY: 7.9 GB
> > > PANIC: "Kernel panic - not syncing: dumpsw: Dump
> > switch pushed; reason: 0x20 args=0xffffffff8046df08"
> > > PID: 0
> > > COMMAND: "swapper"
> > > TASK: ffffffff8038c340 (1 of 4) [THREAD_INFO:
> > ffffffff80422000]
> > > CPU: 0
> > > STATE: TASK_RUNNING (PANIC)
> > >
> > > crash> bt
> > > PID: 0 TASK: ffffffff8038c340 CPU: 0 COMMAND: "swapper"
> > > #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
> > > #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
> > > #2 [ffffffff8046dde0] panic at ffffffff801327fa
> > > #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
> > > #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
> > > #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
> > > #6 [ffffffff8046df40] do_nmi at ffffffff80323365
> > > #7 [ffffffff8046df50] nmi at ffffffff8032268f
> > > [exception RIP: smp_send_stop+84]
> > > RIP: ffffffff80116e44 RSP: ffffffff8046ddd8 RFLAGS:
> 00000246
> > > RAX: 00000000000000ff RBX: ffffffff8831c1f8 RCX:
> > 000041049c7256e8
> > > RDX: 0000000000000005 RSI: 000000005238a938 RDI:
> > 00000000002896a0
> > > RBP: ffffffff8046df08 R8: 00000000000040fb R9:
> > 000000005238a7e8
> > > R10: 0000000000000002 R11: 0000ffff0000ffff R12:
> > 000000000000000c
> > > R13: 0000000000000000 R14: 0000000000000000 R15:
> > 0000000000000000
> > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> > > --- <NMI exception stack> ---
> > > #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
> > > bt: WARNING: Loop detected in the NMI Exception Stack!
> > > bt: cannot transition from exception stack to current process
> stack:
> > > exception stack pointer: ffffffff8046dc50
> > > process stack pointer: ffffffff8046ddd8
> > > current stack base: ffffffff80422000
> > > crash>
> >
> > What exactly was the sequence of events? Was the system
repeatedly and
> > erroneously running one NMI after another for some
reason, and *then* the
> > "dump switch" was pressed? And the dumpsw_notify()
function sends another
> > NMI? And where does that dumpsw_notify() function live anyway?
> >
> > I'm just trying to get a grip on whether this will ever
happen again, or
> > whether it's fixing a one-time hardware abnormality?
> >
> > Dave
> >
>
> As far as I am aware, we have had three separate customers encounter
> this issue. It appears from the hardware SEL log that multiple PCI
> SERR's came in at the same time and somehow triggered multiple NMIs.
> You can see the SEL entries from the output of the "ipmitool sel"
> command:
>
> 0231 11FC 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 15 08
> Crit.
> Interrupt PCI SERR (PCI Bus 15 Device 1 Function 0) was asserted
> 0232 1210 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 20
> Crit.
> Interrupt PCI SERR (PCI Bus 16 Device 4 Function 0) was asserted
> 0233 1224 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 21
> Crit.
> Interrupt PCI SERR (PCI Bus 16 Device 4 Function 1) was asserted
> 0234 1238 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 30
> Crit.
> Interrupt PCI SERR (PCI Bus 16 Device 6 Function 0) was asserted
> 0235 124C 02 01:53:47 12/17/09 3300 04 13 EB 6F A5 16 31
> Crit.
> Interrupt PCI SERR (PCI Bus 16 Device 6 Function 1) was asserted
>
> My understanding of the architecture of the system is that
only one NMI
> should have been asserted to the OS regardless of the
number of times
> there was a hardware error, but clearly that wasn't the
case in these
> three instances.
>
> Also, it seemed like my patch made crash a little bit more
tolerant of
> "corrupted" dump images which I thought could only be a good thing.
Right, I understand that...
But you didn't answer my questions re: the "dump switch" procedure and
the dumpsw_notify() function. Was the system stuck in the
NMI handler,
somebody noticed the repetetive NMIs (?), and so they hit the
"dump switch"?
(whatever that may be...)
Dave
--
Crash-utility mailing list
Crash-utility(a)redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
Sorry, guess I wasn't clear. Nobody hit the dump switch on these
systems. They simply had multiple hardware errors that apparently
triggered the NMI more than once. That's what I was trying to show with
the SEL records, that the multiple NMIs were straight from hardware with
no human intervention.
The systems went through a panic (due to multiple NMIs), a reboot, and
then crash was run on the resulting dump. In fact crash was
automatically run via a startup script and there was no human
intervention until after it was noticed that crash was filling up the
root file system with a temporary file due to the inifinite loop.
-Lucas
-Lucas