On Fri, 25 Apr 2014 10:11:28 -0400 (EDT)
Dave Anderson <anderson(a)redhat.com> wrote:
----- Original Message -----
> Hi all,
>
> as discovered by my colleagues, the backtrace code has been broken for
> NMI stacks since kernel commit 3f3c8b8c4b2a34776c3470142a7c8baafcda6eb0
> (Linux 3.3).
>
> I am working on a fix, but it's tricky to get all cases right. For
> example, the copied and saved register locations were swapped with
> kernel commit 28696f434fef0efa97534b59986ad33b9c4df7f8, so we have at
> least 3 possible layouts:
>
> 1. pre-3.3 (no nesting)
> 2. 3.3 to 3.8 (saved, then copied)
> 3. 3.8+ (copied, then saved)
>
> I'm writing this mail to tell you I'm working on it. I don't have a fix
> (yet), but want to avoid duplicate efforts if more people start working
> on this.
>
> Petr T
Thanks Petr, I appreciate your efforts, and won't get in your way...
I was aware of Steven's work in this area, but haven't yet seen any
core dumps that show the changes. What exactly happens? Does the
backtrace fumble its way through the top of the NMI stack, but then
successfully make the transition to the original stack, or does it
just blow up while transitioning through the NMI stack?
It will show an incorrect register dump, but the backtrace continues.
For example:
PID: 0 TASK: ffff880232b2c440 CPU: 7 COMMAND: "kworker/0:1"
#0 [ffff88023fdc7e40] crash_nmi_callback at ffffffff8102428f
#1 [ffff88023fdc7e50] notifier_call_chain at ffffffff81461ec7
#2 [ffff88023fdc7e80] __atomic_notifier_call_chain at ffffffff81461f0d
#3 [ffff88023fdc7e90] notify_die at ffffffff81461f5d
#4 [ffff88023fdc7ec0] default_do_nmi at ffffffff8145f3a7
#5 [ffff88023fdc7ee0] do_nmi at ffffffff8145f5d8
#6 [ffff88023fdc7ef0] restart_nmi at ffffffff8145eb2d
[exception RIP: mwait_idle+423]
RIP: ffffffff8100b217 RSP: ffff880232b2ff18 RFLAGS: 00000246
RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000246
RDX: ffff880232b2ff18 RSI: 0000000000000018 RDI: 0000000000000001
RBP: ffffffff8100b217 R8: ffffffff8100b217 R9: 0000000000000018
R10: ffff880232b2ff18 R11: 0000000000000246 R12: ffffffffffffffff
R13: ffffffff81d36108 R14: ffff880232b2ffd8 R15: 0000000000000000
ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018
--- <NMI exception stack> ---
#7 [ffff880232b2ff18] mwait_idle at ffffffff8100b217
#8 [ffff880232b2ff30] cpu_idle at ffffffff81002126
If there is a nested NMI, reading the code suggests crash may loop again to the NMI stack,
but I don't have a sample dump file ATM.
Petr T