Re: [Crash-utility] Unable to switch stack frames while using crash

Thursday, 23 June 2011

----- Original Message -----
...
 Hi Dave,

 Thanks for the help.

 I have some doubts regarding kdump and crash utility.

 I am analyzing a vmcore dump caused by an oops in customer location
 using crash utility.The oops report is below

 [345132.723424] BUG: unable to handle kernel NULL pointer dereference
 at
 0000000000000005
 [345132.724928] IP: [<ffffffff811f03b3>] n_tty_read+0x58c/0x818
 [345132.726100] PGD 2c8e03067 PUD 2cbd88067 PMD 0
 [345132.727187] Oops: 0000 [#1] SMP
 [345132.727879] last sysfs file: /sys/block/loop7/dev
 [345132.728935] CPU 1
 [345132.729396] Modules linked in: xt_tcpudp iptable_filter ip_tables
 x_tables strmfs_mod bond0 ipmi_devintf hpwdt sctp ipv6 crc32c
 libcrc32c
 loop ipmi_si tpm_tis ipmi_msghandler hpilo tpm tpm_bios psmouse
 serio_raw shpchp pci_hotplug container processor evdev ext3 jbd
 mbcache
 dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod cdrom
 ide_pci_generic ide_core usbhid hid ata_piix ata_generic libata
 ehci_hcd
 bnx2 uhci_hcd e1000e cciss scsi_mod button thermal fan thermal_sys edd
 [last unloaded: scsi_wait_scan]
 [345132.739511] Pid: 13366, comm: telnet Not tainted
 2.6.32-cdma-18-amd64 #1 ProLiant DL380 G6
 [345132.741423] RIP: 0010:[<ffffffff811f03b3>] [<ffffffff811f03b3>]
n_tty_read+0x58c/0x818
 [345132.743220] RSP: 0018:ffff88031ce75da8 EFLAGS: 00010246
 [345132.744469] RAX: 0000000000000000 RBX: ffff8802cbd54a68 RCX: 000000000061c044
 [345132.746061] RDX: 0000000000000005 RSI: ffff88031ce75e87 RDI: ffff8802cbd54d1c
 [345132.747726] RBP: ffff88031ce75eb8 R08: 0000000000000000 R09: 0000000000000000
 [345132.749391] R10: 0000000000616680 R11: 0000000000000246 R12: 000000000061c044
 [345132.750981] R13: ffff8802cbd54800 R14: 0000000000000000 R15: 7fffffffffffffff
 [345132.752650] FS: 00007ffff7fee6f0(0000) GS:ffff880033020000(0000)
knlGS:0000000000000000
 [345132.754569] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 [345132.755915] CR2: 0000000000000005 CR3: 000000030c408000 CR4: 00000000000006e0
 [345132.757579] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [345132.759169] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 [345132.760778] Process telnet (pid: 13366, threadinfo ffff88031ce74000, task
ffff88031b60d580)
 [345132.762707] Stack:
 [345132.763162] ffff88031b60d580 ffff88031b60d580 ffff88031b60d580 ffff88031b60d580
 [345132.764791] <0> 000000000061c02b 0000000000000000 0000000000000000
000000000061c02a
 [345132.766510] <0> ffff8802de651a40 ffff8802cbd549c0 ffff8802cbd54c90
ffff8802cbd54d1c
 [345132.768270] Call Trace:
 [345132.768877] [<ffffffff81045f84>] ? default_wake_function+0x0/0xf
 [345132.770309] [<ffffffff811ebf7e>] tty_read+0x7d/0xba
 [345132.771526] [<ffffffff810ebcc8>] vfs_read+0xab/0x167
 [345132.772541] [<ffffffff810ebe48>] sys_read+0x47/0x6f
 [345132.773526] [<ffffffff8100bbc2>] system_call_fastpath+0x16/0x1b
 [345132.774652] Code: 00 41 8b 85 5c 02 00 00 48 8b 9d 78 ff ff ff f0 0f
 b3 03 45 19 f6 49 63 95 5c 02 00 00 49 8b 85 50 02 00 00 48 8b bd 48 ff
 ff ff <0f> be 1c 10 e8 fc 6b 0e 00 48 89 c6 41 8b 85 5c 02 00 00 41 ff
 [345132.778840] RIP [<ffffffff811f03b3>] n_tty_read+0x58c/0x818
 [345132.780107] RSP <ffff88031ce75da8>
 [345132.780969] CR2: 0000000000000005
 [345132.781786] hpwdt: New timer passed in is 120 seconds.
 [345132.782942] hpwdt: timer reset to 120 for kdump

 After analysis, we figured out that the crash occurs in the function
 n_read_tty of kernel-source/drivers/char/n_tty.c . The oops occurred on
 linux kernel 2.6.32. Below is the code fragment where the page fault
 occurred. The page fault occurs when executing the statement c =
 tty->read_buf[tty->read_tail] .

 /* N.B. avoid overrun if nr == 0 */
 while (nr && tty->read_cnt) {

 int eol;

 eol = test_and_clear_bit(tty->read_tail,
 tty->read_flags);
 c = tty->read_buf[tty->read_tail]; //
 page fault statement after analyzing oops

 spin_lock_irqsave(&tty->read_lock, flags);
 tty->read_tail = ((tty->read_tail+1) &
 (N_TTY_BUF_SIZE-1));
 tty->read_cnt--;
 if (eol) {
 /* this test should be
 redundant:
 * we shouldn't be reading data
 if
 * canon_data is 0
 */
 if (--tty->canon_data < 0)
 tty->canon_data = 0;
 }
 spin_unlock_irqrestore(&tty->read_lock,
 flags);

 Below is the contents of the structure tty_struct ( at the time of
 oops
 ). This was passed as an argument to the function n_read_tty().

 tty_struct ffff8802cbd54800
 struct tty_struct { ...
 magic = 21505,
 driver = 0xffff88031b54ea00,
 ops = 0xffffffff8130f650,
 name = "pts9\000\...",
 driver_data = 0xffff88029c8a9668,
 icanon = 1 '\001',
 read_buf = 0xffff8802cbfe6000 "",
 read_head = 0,
 read_tail = 0,
 read_cnt = 0,
 read_flags = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
 canon_data = 0,
 ......................................

 As per crash utility the field read_cnt is 0 when kernel oopsed.In that
 case, the statement while (nr && tty->read_cnt) in the above code
 fragment should have failed. This leads me to think that there was some
 other thread/task in kernel which should have updated the read_cnt
 field in parallel. However the crash utility reports that the runqueue
 of all CPUs at the time of crash as idle. Except CPU1 which was
 executing the user program telnet in kernel context ( system call ).
 Below is the runqueue output.

 CPU 0 RUNQUEUE: ffff880033012d80
 CURRENT: PID: 0 TASK: ffffffff814204b0 COMMAND: "swapper"
 RT PRIO_ARRAY: ffff880033012e98
 [no tasks queued]
 CFS RB_ROOT: ffff880033012e10
 [no tasks queued]

 CPU 1 RUNQUEUE: ffff880033032d80
 CURRENT: PID: 13366 TASK: ffff88031b60d580 COMMAND: "telnet"
 RT PRIO_ARRAY: ffff880033032e98
 [no tasks queued]
 CFS RB_ROOT: ffff880033032e10
 [no tasks queued]

 CPU 2 RUNQUEUE: ffff880033052d80
 CURRENT: PID: 0 TASK: ffff88031e0e3540 COMMAND: "swapper"
 RT PRIO_ARRAY: ffff880033052e98
 [no tasks queued]
 CFS RB_ROOT: ffff880033052e10
 [no tasks queued]

 CPU 3 RUNQUEUE: ffff880033072d80
 CURRENT: PID: 0 TASK: ffff88031e113580 COMMAND: "swapper"
 RT PRIO_ARRAY: ffff880033072e98
 [no tasks queued]
 CFS RB_ROOT: ffff880033072e10
 [no tasks queued]

 How is this logically possible. Crash reports there are no tasks running
 currently. Or before the oops trigger and kdump capturing the memory
 image, some process/thread ran which could have updated the data
 structure. I wanted to know if this scenario is possible. I kindly
 request your suggestion/guidance. Please let me know if you need any
 other details. 
It's not clear to me, but I'm not at all familiar with this code area.
Maybe during hard or soft IRQ handling on this or another cpu?  Presumably 
there would be protection against that happening, and maybe it's of interest
that the very next instruction after the fault is a spin_lock_irqsave()
call, but that's just a wild guess on my part...

Anyway, the crash utility shows what the state of memory was at the point
when the "telnet" process (indirectly) issued NMI interrupts to all of
the other cpus.  You can verify where the other cpus were (in idle) by
"bt -a", which shows/verifies the reception of the NMI shutdown interrupt.

Dave

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Crash-utility] Unable to switch stack frames while using crash