Dear crash-utility,
In our vmcore analysis infrastructure we stumbled on a case where crash
doesn't terminate. When examining the state of the process with gdb
it seems to be looping forever over /proc/$pid/stat in an attempt to determine
the PID of a process that doesn't exist any more.
The backtrace:
(gdb) bt
#0 0x00007fb9814fea57 in munmap () from /lib/libc.so.6
#1 0x00007fb9814a30aa in _IO_setb () from /lib/libc.so.6
#2 0x00007fb9814a1d18 in _IO_file_close_it () from /lib/libc.so.6
#3 0x00007fb981495a48 in fclose () from /lib/libc.so.6
#4 0x00000000004fe75b in output_command_to_pids () at cmdline.c:775
#5 0x00000000004fed7c in setup_redirect (origin=1) at cmdline.c:519
#6 0x00000000005012bb in process_command_line () at cmdline.c:149
#7 0x000000000045f575 in main_loop () at main.c:610
#8 0x0000000000541af9 in captured_command_loop (data=0x7fb982282000)
at ./main.c:226
#9 0x000000000053fd8b in catch_errors (func=<value optimized out>,
func_args=<value optimized out>, errstring=<value optimized out>,
mask=<value optimized out>) at exceptions.c:520
#10 0x00000000005415b6 in captured_main (data=<value optimized out>)
at ./main.c:924
#11 0x000000000053fd8b in catch_errors (func=<value optimized out>,
func_args=<value optimized out>, errstring=<value optimized out>,
mask=<value optimized out>) at exceptions.c:520
#12 0x0000000000540994 in gdb_main (args=0x1000) at ./main.c:939
#13 0x00000000005409ce in gdb_main_entry (argc=<value optimized out>,
argv=0x1000) at ./main.c:959
#14 0x000000000046025a in main (argc=<value optimized out>,
argv=<value optimized out>) at main.c:525
The problematic code:
720 /*
721 * Determine the pids of the current popen'd shell and output command.
722 * This is all done using /proc; the ps kludge at the bottom of this
723 * routine is legacy, and should only get executed if /proc doesn't exist.
724 */
725 static int
726 output_command_to_pids(void)
727 {
...
738 int retries;
739
740 retries = 0;
741 pc->pipe_pid = pc->pipe_shell_pid = 0;
742 sprintf(lookfor, "(%s)", pc->pipe_command);
743 stall(1000);
744 retry:
745 if (is_directory("/proc") && (dirp =
opendir("/proc"))) {
746 for (dp = readdir(dirp); dp && !pc->pipe_pid;
747 dp = readdir(dirp)) {
748 if (!decimal(dp->d_name, 0))
749 continue;
750 sprintf(buf1, "/proc/%s/stat", dp->d_name);
751 if (file_exists(buf1, NULL) &&
752 (stp = fopen(buf1, "r"))) {
753 if (fgets(buf2, BUFSIZE, stp)) {
754 pid = strtok(buf2, " ");
755 name = strtok(NULL, " ");
756 status = strtok(NULL, " ");
757 p_pid = strtok(NULL, " ");
758 pgrp = strtok(NULL, " ");
759 if (STREQ(name, "(sh)") &&
760 (atoi(p_pid) == getpid()))
761
pc->pipe_shell_pid = atoi(pid);
762 if (STREQ(name, lookfor) &&
763 ((atoi(p_pid) == getpid()) ||
764 (atoi(p_pid) ==
pc->pipe_shell_pid)
765 || (atoi(pgrp) ==
getpid()))) {
766 pc->pipe_pid = atoi(pid);
767 console(
768 "FOUND[%d] (%d->%d->%d) %s %s p_pid:
%s pgrp: %s\n",
769 retries, getpid(),
770 pc->pipe_shell_pid,
771 pc->pipe_pid,
772 name, status,
p_pid, pgrp);
773 }
774 }
775 fclose(stp);
776 }
777 }
778 closedir(dirp);
779 }
780
781 if (!pc->pipe_pid && ((retries++ < 10) ||
pc->pipe_shell_pid)) {
782 stall(1000);
783 goto retry;
784 }
Looking at how many times it has been looping over /proc:
(gdb) p retries
$19 = 138056108
It found the PID of the shell but not of the command:
(gdb) p pc->pipe_shell_pid
$20 = 9306
(gdb) p pc->pipe_pid
$21 = 0
For completeness the command that was being run was looking like this:
(gdb) p pc->orig_line
$26 = "log | grep -A1 'some string' >> /some/file", '\000'
[...]
So it seems something like this happened:
+>popen(grep)
+--> fork(); execve(sh)
+---> fork(); execve(grep)
+----> grep exit()s for some reason
+>crash(8) finds sh in /proc
+---> sh exit
+>crash(8) keeps looking for grep in /proc
I have a second core showing a similar situation if that's of any use but
now we just work around the problem by wrapping crash(8) within timeout(1).
We could try and fix that function to bail out when the shell exits
but it really doesn't look like a nice way to do it to me. So I looked
at the reasons we want the PID of that command and it seems there are
two:
* determining whether the process is still alive
This can be done by checking whether the intervening shell is still alive.
Obtaining only the PID of the shell seems less problematic than trying to
get the PID of the grandchildren. At worst reimplementing popen()
to store the PID of sh is not exactly hard.
* terminating the process forcibly (SIGKILL)
This is done in close_output() which is only called from within restart() when
its argument is not SIGSEGV, SIGPIPE, SIGINT or 0. I cannot find that function
being set as a signal handler for anything else or being called with an
argument different from 0. As far as I can tell this is dead code.
Before I write/test/submit a patch, am I missing something or would it make
sense to get rid of that pipe_pid?
Thanks,
Adrien