My only guess is that there is something in the transition between the regular kernel and
the kdump kernel (somewhere in the kexec path) that re-opens the door for a queued up NMI
to come in just before the kdump kernel takes over. I've been digging through that
code, but so far haven't come up with anything that explains it yet.
-Lucas
-----Original Message-----
From: crash-utility-bounces(a)redhat.com
[mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
Sent: Tuesday, June 29, 2010 5:58 AM
To: Discussion list for crash utility usage,maintenance and
development
Subject: Re: [Crash-utility] infinite loop in crash due to
double-NMI on x86_64 system
----- "Petr Tesarik" <ptesarik(a)suse.cz> wrote:
> Silacci, Lucas píše v Po 28. 06. 2010 v 17:26 -0400:
> > The dumpsw_notify function is part of a driver that was
added to our
> > systems to trigger kernel panics when an NMI occurs. In
the version of
> > the kernel we are using (SLES 10 SP1) this was necessary
to cause an
> > actual panic to happen and a dump to be saved when an NMI occurred
> > (especially due to a dump switch being pressed, hence the name).
> >
> > That driver registers a callback (dumpsw_notify) into the
die_chain and
> > calls panic() if the die code is a DIE_NMI.
>
> Hi,
>
> my opinion is that a NMI is ... well, a non-maskable
interrupt. Which
> means there is nothing the kernel could possibly do to
prevent the NMI
> handler itself from being interrupted by another NMI. Whatever the
> reason for it.
Really? According to the AMD x86_64 manual -- note the
"Masking" section:
8.3.3 NMI-Non-Maskable-Interrupt Exception (Vector 2)
An NMI exception occurs as a result of system logic signalling a
non-maskable interrupt to the processor.
Error Code Returned: None.
Program Restart: NMI is an interrupt. The processor
recognizes an NMI
at an instruction boundary. The saved instruction pointer
points to the
instruction immediately following the boundary where the
NMI was recognized.
Masking: NMI cannot be masked. However, when an NMI is
executed by the
processor, recognition of subsequent NMIs are disabled
until an IRET
instruction is executed.
And looking at the backtrace, I'm still having a hard time
understanding how
it was possible. What am I missing?
Dave
> Having the crash utility loop forever on such dumps is
annoying, at the
> very least. And I imagine, such hangs could cause quite
some headache to
> Louis Bouchard. ;)
>
> Just my $0.02,
> Petr Tesarik
> PID: 0 TASK: ffffffff8038c340 CPU: 0 COMMAND: "swapper"
> #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
> #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
> #2 [ffffffff8046dde0] panic at ffffffff801327fa
> #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
> #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
> #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
> #6 [ffffffff8046df40] do_nmi at ffffffff80323365
> #7 [ffffffff8046df50] nmi at ffffffff8032268f
> [exception RIP: smp_send_stop+84]
> RIP: ffffffff80116e44 RSP: ffffffff8046ddd8 RFLAGS: 00000246
> RAX: 00000000000000ff RBX: ffffffff8831c1f8 RCX:
000041049c7256e8
> RDX: 0000000000000005 RSI: 000000005238a938 RDI:
00000000002896a0
> RBP: ffffffff8046df08 R8: 00000000000040fb R9:
000000005238a7e8
> R10: 0000000000000002 R11: 0000ffff0000ffff R12:
000000000000000c
> R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
--
Crash-utility mailing list
Crash-utility(a)redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility