----- "Lucas Silacci" <Lucas.Silacci(a)teradata.com> wrote:
Hi,
I've run into an issue where crash will enter an infinite loop while
decoding exception stacks if those stacks get corrupted.
We've seen this on four different systems where the hardware generated
multiple NMIs and the second and subsequent NMIs caused the NMI
exception stack to be overwritten. When this condition is hit, the
bottom rsp on the NMI exception stack (which would normally point you
back to the kernel thread stack or possibly a different exception stack)
points you back into the middle of the same NMI exception stack. This
causes crash to infinitely loop when it tries to decode that exception
stack.
Now clearly the root cause of the issue is faulty hardware that
generated multiple NMIs. However a very small change in crash can detect
this issue and stop the infinite loop from happening thereby allowing
you to get to a point in crash where you can actually tell that it was
an NMI that caused the system to dump.
The patch is attached to this email. For x86_64 it will detect the
condition of any exception stack that points back at itself.
Please feel free to ask me any questions on this.
Wow, that's pretty interesting -- I've certainly never seen that before.
Can you show me what the backtrace looks like with your patch applied?
Thanks,
Dave