----- "Bob Montgomery" <bob.montgomery(a)hp.com> wrote:
Well, I've been picking at this some more. PID 1 is in the
system, but
crash misses it when it's building its table of tasks in
refresh_hlist_task_table_v2(). In fact, on my particular dump, it loses
track of at least 3 processes.
The attached patch changes that behavior. It has to do with collisions
on the pid_hash table where an early item on the chain has a NULL task
pointer which causes the code to ignore subsequent items on that
collision chain. I'm not sure what it means when the tasks[0].first
pointer in the struct pid is NULL, but that's what triggers the problem
and keeps crash from following the pid_chain pointer to the next struct
pid. I am not confident that this whole area is correct yet, just
closer to correct than it was.
These now appear in the ps output:
crash-5.0.6-fix2> ps 1 8144 998
PID PPID CPU TASK ST %MEM VSZ RSS COMM
1 0 1 ffff81012bd3c780 IN 0.0 6124 688 init
8144 6257 0 ffff81011996e140 RU 0.7 108876 35016 mirrorclient
998 11 0 ffff81012a9cd780 IN 0.0 0 0 [fc_dl_1]
where before:
crash-5.0.6-fix> ps 1 8144 998
ps: invalid task or pid value: 1
ps: invalid task or pid value: 8144
ps: invalid task or pid value: 998
This might have been some transition behavior of the pid hash design in
the kernel, because I've got two dumps based on 2.6.18 kernels that show
missing processes (this one had 3 out of 532, the other had 1 out of
146), but my new patched crash doesn't reveal any missing processes in
2.6.29 and newer dumps (I checked 4 dumps, with process counts ranging
from 362 to 926). Only my recent 2.6.18 dump was lucky enough to be
missing PID 1, with me being lucky enough to try crash's mount command,
or we'd still not know about it :-)
Yeah, I agree that it must be catching a kernel transition.
And it's probably not being seen in your 2.6.29-and-newer dumps because
2.6.24-and-later kernels use refresh_hlist_task_table_v3().
The patch is simple, but has lots of lines because I moved the
indent.
The patch looks reasonable and safe. I'll run it against my stable of
sample dumpfiles to see if I can find one...
Anyway, nice catch Bob -- and thanks again for tracking down yet another
gnarly issue,
Dave