----- "Lucas Silacci" <Lucas.Silacci(a)teradata.com> wrote:
Sorry, guess I wasn't clear. Nobody hit the dump switch on these
systems. They simply had multiple hardware errors that apparently
triggered the NMI more than once. That's what I was trying to show with
the SEL records, that the multiple NMIs were straight from hardware with
no human intervention.
The systems went through a panic (due to multiple NMIs),
That's what I'm trying to figure out -- when and how was it decided that
the machine should panic instead of continuing to handle the stream of NMIs?
In other words, this "dumpsw_notify" function -- why was it called?
> PID: 0 TASK: ffffffff8038c340 CPU: 0 COMMAND:
"swapper"
> #0 [ffffffff8046dc50] machine_kexec at ffffffff8011a95b
> #1 [ffffffff8046dd20] crash_kexec at ffffffff80154351
> #2 [ffffffff8046dde0] panic at ffffffff801327fa
> #3 [ffffffff8046ded0] dumpsw_notify at ffffffff8831c0c3
> #4 [ffffffff8046dee0] notifier_call_chain at ffffffff8032481f
> #5 [ffffffff8046df00] default_do_nmi at ffffffff80322fab
> #6 [ffffffff8046df40] do_nmi at ffffffff80323365
> #7 [ffffffff8046df50] nmi at ffffffff8032268f
> [exception RIP: smp_send_stop+84]
> RIP: ffffffff80116e44 RSP: ffffffff8046ddd8 RFLAGS: 00000246
> RAX: 00000000000000ff RBX: ffffffff8831c1f8 RCX: 000041049c7256e8
> RDX: 0000000000000005 RSI: 000000005238a938 RDI: 00000000002896a0
> RBP: ffffffff8046df08 R8: 00000000000040fb R9: 000000005238a7e8
> R10: 0000000000000002 R11: 0000ffff0000ffff R12: 000000000000000c
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #8 [ffffffff8046ddd8] smp_send_stop at ffffffff80116e44
From what you're implying, there is no physical "dump
switch".
So I'm trying figure out where that "dumpsw_notify()"
function
comes from? Whose module is that and what is its purpose?
Dave
a reboot, and
then crash was run on the resulting dump. In fact crash was
automatically run via a startup script and there was no human
intervention until after it was noticed that crash was filling up the
root file system with a temporary file due to the inifinite loop.