On 02/21/2012 10:44 AM, Dave Anderson wrote:
 
 
 ----- Original Message -----
> We have a recurring problem in our crash analysis system, where remote users
> get disconnected and crash starts endlessly looping trying to write to stdout.
> An strace of a recent instance is looping on:
>
> write(1, "  JIFFIES\n", 10)             = -1 EIO (Input/output error)
>
> but that isn't always the output string.
>
> this is a problem in out shared environment because the orphaned crash tasks
> eat up the CPUs, and we don't have the privilege to kill each others tasks.
>
> thanks,
> --Guy
 
 Hmmm, upon initial glance, this seemed to be related to the crash-5.0.2
 fix that you guys reported:
 
     - Fix to prevent a crash session that is run over a network connection
       that is killed/removed from going into 100% cpu-time loop.  Without
       the patch, the behavior of the built-in readline() library call in
       gdb-7.0 has changed such that the function returns when the EOF is
       encountered on /dev/tty, and the crash session goes into an endless
       loop; whereas in gdb-6.1, the readline() call never returns because
       the crash session gets killed while running in the library code.
       (anderson(a)redhat.com)
 
 But if the orphaned task is repetetively writing the same thing, it 
 would never get to the next readline() call, where it would kill
 itself.  Taking your example, the "JIFFIES" write() is part of a
"timer"
 command, but I'm trying to understand how/why the command is not just 
 completing a series of (failed) fprintf's, and then falling into
 the next readline() -- where it should kill itself?  By any chance
 was the remote caller doing a "repeat" command on the live system,
 or something like that?  (sounds doubtful since you'd have to have
 root privileges to do that...)
  
This is not a live system. This is the setup where we analyze vmcores sent in
by our customers.
I don't understand how it happens either, unless for some reason fprintf is
re-trying the failed write().
This is not the only failure scenario. I just saw another one repeating on
this sequence:
rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
{0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
rt_sigreturn(0x8)                       = -1 ENETDOWN (Network is down)
--- SIGFPE (Floating point exception) @ 0 (0) ---
rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
{0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
rt_sigreturn(0x8)                       = -1 ENETDOWN (Network is down)
--- SIGFPE (Floating point exception) @ 0 (0) ---
rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
{0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
rt_sigreturn(0x8)                       = -1 ENETDOWN (Network is down)
--- SIGFPE (Floating point exception) @ 0 (0) ---
Perhaps it isn't a crash program issue at all. Maybe it's at the system
library level.
--Guy