[Crash-utility] crash sometimes doesn't terminate, loops forever looking for a process that doesn't exist

Saturday, 5 November 2011

Dear crash-utility,

In our vmcore analysis infrastructure we stumbled on a case where crash
doesn't terminate. When examining the state of the process with gdb
it seems to be looping forever over /proc/$pid/stat in an attempt to determine
the PID of a process that doesn't exist any more.

The backtrace:
(gdb) bt
#0  0x00007fb9814fea57 in munmap () from /lib/libc.so.6
#1  0x00007fb9814a30aa in _IO_setb () from /lib/libc.so.6
#2  0x00007fb9814a1d18 in _IO_file_close_it () from /lib/libc.so.6
#3  0x00007fb981495a48 in fclose () from /lib/libc.so.6
#4  0x00000000004fe75b in output_command_to_pids () at cmdline.c:775
#5  0x00000000004fed7c in setup_redirect (origin=1) at cmdline.c:519
#6  0x00000000005012bb in process_command_line () at cmdline.c:149
#7  0x000000000045f575 in main_loop () at main.c:610
#8  0x0000000000541af9 in captured_command_loop (data=0x7fb982282000)
at ./main.c:226
#9  0x000000000053fd8b in catch_errors (func=<value optimized out>,
func_args=<value optimized out>, errstring=<value optimized out>,
mask=<value optimized out>) at exceptions.c:520
#10 0x00000000005415b6 in captured_main (data=<value optimized out>)
at ./main.c:924
#11 0x000000000053fd8b in catch_errors (func=<value optimized out>,
func_args=<value optimized out>, errstring=<value optimized out>,
mask=<value optimized out>) at exceptions.c:520
#12 0x0000000000540994 in gdb_main (args=0x1000) at ./main.c:939
#13 0x00000000005409ce in gdb_main_entry (argc=<value optimized out>,
argv=0x1000) at ./main.c:959
#14 0x000000000046025a in main (argc=<value optimized out>,
argv=<value optimized out>) at main.c:525

The problematic code:
 720 /*
 721  *  Determine the pids of the current popen'd shell and output command.
 722  *  This is all done using /proc; the ps kludge at the bottom of this
 723  *  routine is legacy, and should only get executed if /proc doesn't exist.
 724  */
 725 static int
 726 output_command_to_pids(void)
 727 {
 ...
 738         int retries;
 739
 740         retries = 0;
 741         pc->pipe_pid = pc->pipe_shell_pid = 0;
 742         sprintf(lookfor, "(%s)", pc->pipe_command);
 743         stall(1000);
 744 retry:
 745         if (is_directory("/proc") && (dirp =
opendir("/proc"))) {
 746                 for (dp = readdir(dirp); dp && !pc->pipe_pid;
 747                      dp = readdir(dirp)) {
 748                         if (!decimal(dp->d_name, 0))
 749                                 continue;
 750                         sprintf(buf1, "/proc/%s/stat", dp->d_name);
 751                         if (file_exists(buf1, NULL) &&
 752                             (stp = fopen(buf1, "r"))) {
 753                                 if (fgets(buf2, BUFSIZE, stp)) {
 754                                         pid = strtok(buf2, " ");
 755                                         name = strtok(NULL, " ");
 756                                         status = strtok(NULL, " ");
 757                                         p_pid = strtok(NULL, " ");
 758                                         pgrp = strtok(NULL, " ");
 759                                         if (STREQ(name, "(sh)") &&
 760                                             (atoi(p_pid) == getpid()))
 761
pc->pipe_shell_pid = atoi(pid);
 762                                         if (STREQ(name, lookfor) &&
 763                                             ((atoi(p_pid) == getpid()) ||
 764                                              (atoi(p_pid) ==
pc->pipe_shell_pid)
 765                                              || (atoi(pgrp) ==
getpid()))) {
 766                                                 pc->pipe_pid = atoi(pid);
 767                                                 console(
 768                             "FOUND[%d] (%d->%d->%d) %s %s p_pid:
%s pgrp: %s\n",
 769                                                     retries, getpid(),
 770                                                     pc->pipe_shell_pid,
 771                                                     pc->pipe_pid,
 772                                                     name, status,
p_pid, pgrp);
 773                                         }
 774                                 }
 775                                 fclose(stp);
 776                         }
 777                 }
 778                 closedir(dirp);
 779         }
 780
 781         if (!pc->pipe_pid && ((retries++ < 10) ||
pc->pipe_shell_pid)) {
 782                 stall(1000);
 783                 goto retry;
 784         }

Looking at how many times it has been looping over /proc:
(gdb) p retries
$19 = 138056108

It found the PID of the shell but not of the command:
(gdb) p pc->pipe_shell_pid
$20 = 9306
(gdb) p pc->pipe_pid
$21 = 0

For completeness the command that was being run was looking like this:
(gdb) p pc->orig_line
$26 = "log | grep -A1 'some string' >> /some/file", '\000'
[...]

So it seems something like this happened:
+>popen(grep)
+--> fork(); execve(sh)
+---> fork(); execve(grep)
+----> grep exit()s for some reason
+>crash(8) finds sh in /proc
+---> sh exit
+>crash(8) keeps looking for grep in /proc

I have a second core showing a similar situation if that's of any use but
now we just work around the problem by wrapping crash(8) within timeout(1).

We could try and fix that function to bail out when the shell exits
but it really doesn't look like a nice way to do it to me. So I looked
at the reasons we want the PID of that command and it seems there are
two:

* determining whether the process is still alive
This can be done by checking whether the intervening shell is still alive.
Obtaining only the PID of the shell seems less problematic than trying to
get the PID of the grandchildren. At worst reimplementing popen()
to store the PID of sh is not exactly hard.

* terminating the process forcibly (SIGKILL)
This is done in close_output() which is only called from within restart() when
its argument is not SIGSEGV, SIGPIPE, SIGINT or 0. I cannot find that function
being set as a signal handler for anything else or being called with an
argument different from 0. As far as I can tell this is dead code.

Before I write/test/submit a patch, am I missing something or would it make
sense to get rid of that pipe_pid?

Thanks,
Adrien

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[Crash-utility] crash sometimes doesn't terminate, loops forever looking for a process that doesn't exist