Re: [Crash-utility] Broken backtrace with nested NMIs

Tuesday, 29 April 2014

On Fri, 25 Apr 2014 10:11:28 -0400 (EDT)
Dave Anderson <anderson(a)redhat.com&gt; wrote:

...
 ----- Original Message -----
 > Hi all,
 > 
 > as discovered by my colleagues, the backtrace code has been broken for
 > NMI stacks since kernel commit 3f3c8b8c4b2a34776c3470142a7c8baafcda6eb0
 > (Linux 3.3).
 > 
 > I am working on a fix, but it's tricky to get all cases right. For
 > example, the copied and saved register locations were swapped with
 > kernel commit 28696f434fef0efa97534b59986ad33b9c4df7f8, so we have at
 > least 3 possible layouts:
 > 
 > 1. pre-3.3 (no nesting)
 > 2. 3.3 to 3.8 (saved, then copied)
 > 3. 3.8+ (copied, then saved)
 > 
 > I'm writing this mail to tell you I'm working on it. I don't have a fix
 > (yet), but want to avoid duplicate efforts if more people start working
 > on this.
 > 
 > Petr T

 Thanks Petr, I appreciate your efforts, and won't get in your way...

 I was aware of Steven's work in this area, but haven't yet seen any
 core dumps that show the changes.  What exactly happens?  Does the
 backtrace fumble its way through the top of the NMI stack, but then
 successfully make the transition to the original stack, or does it 
 just blow up while transitioning through the NMI stack? 
It will show an incorrect register dump, but the backtrace continues.
For example:

PID: 0      TASK: ffff880232b2c440  CPU: 7   COMMAND: "kworker/0:1"
 #0 [ffff88023fdc7e40] crash_nmi_callback at ffffffff8102428f
 #1 [ffff88023fdc7e50] notifier_call_chain at ffffffff81461ec7
 #2 [ffff88023fdc7e80] __atomic_notifier_call_chain at ffffffff81461f0d
 #3 [ffff88023fdc7e90] notify_die at ffffffff81461f5d
 #4 [ffff88023fdc7ec0] default_do_nmi at ffffffff8145f3a7
 #5 [ffff88023fdc7ee0] do_nmi at ffffffff8145f5d8
 #6 [ffff88023fdc7ef0] restart_nmi at ffffffff8145eb2d
    [exception RIP: mwait_idle+423]
    RIP: ffffffff8100b217  RSP: ffff880232b2ff18  RFLAGS: 00000246
    RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000246
    RDX: ffff880232b2ff18  RSI: 0000000000000018  RDI: 0000000000000001
    RBP: ffffffff8100b217   R8: ffffffff8100b217   R9: 0000000000000018
    R10: ffff880232b2ff18  R11: 0000000000000246  R12: ffffffffffffffff
    R13: ffffffff81d36108  R14: ffff880232b2ffd8  R15: 0000000000000000
    ORIG_RAX: 0000000000000000  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #7 [ffff880232b2ff18] mwait_idle at ffffffff8100b217
 #8 [ffff880232b2ff30] cpu_idle at ffffffff81002126

If there is a nested NMI, reading the code suggests crash may loop again to the NMI stack,
but I don't have a sample dump file ATM.

Petr T

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] Broken backtrace with nested NMIs