D. Hugh Redelmeier wrote:
| From: Dave Anderson <anderson(a)redhat.com>
| D. Hugh Redelmeier wrote:
| > ==> Worse: while it is awaiting my RETURN, it is burning 100% of the CPU!
| >
| > Here is what "ps laxgwf" says about the crash process and its child.
| >
| > F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
| > 4 0 4426 4406 25 0 416812 332764 - R+ pts/5 80:36
| > | | \_ crash --readnow
| > /usr/lib/debug/lib/modules/2.6.21-1.3228.fc7/vmlinux
| > /var/crash/2007-07-02-13:42/vmcore
| > 0 0 4989 4426 18 0 73976 740 - S+ pts/5 0:00
| > | | \_ /usr/bin/less -E -X -Ps -- MORE --
| > forward\: <SPACE>, <ENTER> or j backward\: b or k quit\: q
| >
| > strace of the crash process shows an infinite sequence of:
| > wait4(4989, 0x7fffcd9cae90, WNOHANG, NULL) = 0
| > wait4(4989, 0x7fffcd9cae90, WNOHANG, NULL) = 0
| > wait4(4989, 0x7fffcd9cae90, WNOHANG, NULL) = 0
| > wait4(4989, 0x7fffcd9cae90, WNOHANG, NULL) = 0
| >
| > This is very wasteful.
| >
| > There are other ways to get into this state. Other places less is
| > being used and is waiting. Probably wherever less is used even if it
| > isn't waiting.
| >
| > I just tested: this problem exists when using a normal xterm.
|
Again, what exactly do you do to reproduce it? I just cannot get the 100%
cpu-time waiting on the "less" sub-shell.
| Yeah, I have seen this on occasions, but I have never been able
| to reproduce it on demand. There was a patch suggestion a while ago,
| but I deferred it until I could reliably reproduce it for testing
| before taking it in.
I've put gdb on the case. The CPU burning that I'm currently experiencing is
in cmdline.c:restore_sanity. The actuall code in question is:
while (!waitpid(pc->stdpipe_pid, &waitstatus, WNOHANG))
;
That sure looks like a busy-wait.
If you execute this code, you should get a busy-wait too.
If you replaced WNOHANG with 0, I think that the wait would have the
same result but not be busy. You would then want to loop in the case
where waitpid returns a -1 with errno == EINTR.
Here's what I'd try (UNTESTED!):
do ; while (waitpid(pc->stdpipe_pid, &waitstatus, 0) == -1 && errno ==
EINTR);
All the uses of WNOHANG in that function look suspicious.
I understand. I also remember that the WNOHANG's were originally added
there on purpose because of hangs I was seeing. But that's not to say
it's the best way of doing things.
As I mentioned before, there was a patch posted by someone (as I recall
who preferred using gdb and gdb scripts with kdump vmcores), but going
back a year and a half into the archives, I can't find it.
Anyway, I'm going to have to be able to reproduce it and test any
changes thoroughly before potentially re-introducing the hangs I
used to see.
Thanks,
Dave