(Was Re: [Crash-utility] mount cmd crashes crash)
On Thu, 2010-08-19 at 12:45 +0000, Dave Anderson wrote:
----- "Bob Montgomery" <bob.montgomery(a)hp.com> wrote:
> > Yeah, it's not important to use the context of pid 1,
but it just needs
> > some context, and I had presumed that init would always exist. I thought
> > that the panic("Attempted to kill the idle task!") in do_exit()
would
> > prevent pid 1 from ever going away -- but apparently your kernel figured
> > out how to do it elsewhere... ;-)
>
> That test is for PID 0, not PID 1 (at least on the kernel I'm
> debugging.) However, there is this also:
>
> if (unlikely(tsk == child_reaper))
> panic("Attempted to kill init!");
That's the one I *meant*... ;-)
>
> And child_reaper in the dump points to a task struct for init that isn't
> in the ps listing. Hmmm. Maybe that part *is* interesting in this dump...
Well, I've been picking at this some more. PID 1 is in the system, but
crash misses it when it's building its table of tasks in
refresh_hlist_task_table_v2(). In fact, on my particular dump, it loses
track of at least 3 processes.
The attached patch changes that behavior. It has to do with collisions
on the pid_hash table where an early item on the chain has a NULL task
pointer which causes the code to ignore subsequent items on that
collision chain. I'm not sure what it means when the tasks[0].first
pointer in the struct pid is NULL, but that's what triggers the problem
and keeps crash from following the pid_chain pointer to the next struct
pid. I am not confident that this whole area is correct yet, just
closer to correct than it was.
These now appear in the ps output:
crash-5.0.6-fix2> ps 1 8144 998
PID PPID CPU TASK ST %MEM VSZ RSS COMM
1 0 1 ffff81012bd3c780 IN 0.0 6124 688 init
8144 6257 0 ffff81011996e140 RU 0.7 108876 35016 mirrorclient
998 11 0 ffff81012a9cd780 IN 0.0 0 0 [fc_dl_1]
where before:
crash-5.0.6-fix> ps 1 8144 998
ps: invalid task or pid value: 1
ps: invalid task or pid value: 8144
ps: invalid task or pid value: 998
This might have been some transition behavior of the pid hash design in
the kernel, because I've got two dumps based on 2.6.18 kernels that show
missing processes (this one had 3 out of 532, the other had 1 out of
146), but my new patched crash doesn't reveal any missing processes in
2.6.29 and newer dumps (I checked 4 dumps, with process counts ranging
from 362 to 926). Only my recent 2.6.18 dump was lucky enough to be
missing PID 1, with me being lucky enough to try crash's mount command,
or we'd still not know about it :-)
The patch is simple, but has lots of lines because I moved the indent.
Bob Montgomery
Working at HP