于 2012年08月25日 02:17, Dave Anderson 写道:
----- Original Message -----
>
>
> ----- Original Message -----
>> Hello Dave,
>>
>> In runq command, when dumping cfs and rt runqueues,
>> it seems that we get the wrong nr_running values of rq
>> and cfs_rq.
>>
>> Please refer to the attached patch.
>>
>> Thanks
>> Zhang Yanfei
>
> Hello Zhang,
>
> I understand what you are trying to accomplish with this patch, but
> none of my test dumpfiles can actually verify it because there is no
> difference with or without your patch. What failure mode did you see
> in your testing? I presume that it just showed "[no tasks queued]"
> for the RT runqueue when there were actually tasks queued there?
>
> The reason I ask is that I'm thinking that a better solution would
> be to simplify dump_CFS_runqueues() by *not* accessing and using
> rq_nr_running, cfs_rq_nr_running or cfs_rq_h_nr_running.
>
> Those counters are only read to determine the "active" argument to
> pass to dump_RT_prio_array(), which returns immediately if it is
> FALSE. However, if we get rid of the "active" argument and simply
> allow dump_RT_prio_array() to always check its queues every time,
> it still works just fine.
>
> For example, I tested my set of sample dumpfiles with this patch:
>
> diff -u -r1.205 task.c
> --- task.c 12 Jul 2012 20:04:00 -0000 1.205
> +++ task.c 22 Aug 2012 15:33:32 -0000
> @@ -7636,7 +7636,7 @@
> OFFSET(cfs_rq_tasks_timeline));
> }
>
> - dump_RT_prio_array(nr_running != cfs_rq_nr_running,
> + dump_RT_prio_array(TRUE,
> runq + OFFSET(rq_rt) + OFFSET(rt_rq_active),
> &runqbuf[OFFSET(rq_rt) +
> OFFSET(rt_rq_active)]);
>
> and the output is identical to testing with, and without, your patch.
>
> So the question is whether dump_CFS_runqueues() should be needlessly
> complicated with all of the "nr_running" references?
>
> In fact, it also seems possible that a crash could happen at a point in
> the scheduler code where those counters are not
> valid/current/trustworthy.
>
> So unless you can convince me otherwise, I'd prefer to just remove
> the "nr_running" business completely.
Hello Zhang,
Here's the patch I've got queued, which resolves the bug you encountered
by simplifying things:
OK. I see.
And based on this patch, I made a new patch to solve the problem when
dumping rt runqueues. Currently dump_RT_prio_array() doesn't support
rt group scheduler.
In my test, I put some rt tasks into one group, just like below:
mkdir /cgroup/cpu/test1
echo 850000 > /cgroup/cpu/test1/cpu.rt_runtime_us
./rtloop1 &
echo $! > /cgroup/cpu/test1/tasks
./rtloop1 &
echo $! > /cgroup/cpu/test1/tasks
./rtloop1 &
echo $! > /cgroup/cpu/test1/tasks
./rtloop98 &
echo $! > /cgroup/cpu/test1/tasks
./rtloop45 &
echo $! > /cgroup/cpu/test1/tasks
./rtloop99 &
echo $! > /cgroup/cpu/test1/tasks
Using crash to analyse the vmcore:
crash> runq
CPU 0 RUNQUEUE: ffff880028216680
CURRENT: PID: 5125 TASK: ffff88010799d540 COMMAND: "sh"
RT PRIO_ARRAY: ffff880028216808
[ 0] PID: 5136 TASK: ffff8801153cc040 COMMAND: "rtloop99"
PID: 6 TASK: ffff88013d7c6080 COMMAND: "watchdog/0"
PID: 3 TASK: ffff88013d7ba040 COMMAND: "migration/0"
[ 1] PID: 5134 TASK: ffff8801153cd500 COMMAND: "rtloop98"
PID: 5135 TASK: ffff8801153ccaa0 COMMAND: "rtloop98"
CFS RB_ROOT: ffff880028216718
[120] PID: 5109 TASK: ffff880037923500 COMMAND: "sh"
[120] PID: 5107 TASK: ffff88006eeccaa0 COMMAND: "sh"
[120] PID: 5123 TASK: ffff880107a4caa0 COMMAND: "sh"
CPU 1 RUNQUEUE: ffff880028296680
CURRENT: PID: 5086 TASK: ffff88006eecc040 COMMAND: "bash"
RT PRIO_ARRAY: ffff880028296808
[ 0] PID: 5137 TASK: ffff880107b35540 COMMAND: "rtloop99"
PID: 10 TASK: ffff88013cc2cae0 COMMAND: "watchdog/1"
PID: 2852 TASK: ffff88013bd5aae0 COMMAND: "rtkit-daemon"
[ 54] CFS RB_ROOT: ffff880028296718
[120] PID: 5115 TASK: ffff8801152b1500 COMMAND: "sh"
[120] PID: 5113 TASK: ffff880139530080 COMMAND: "sh"
[120] PID: 5111 TASK: ffff88011bd86080 COMMAND: "sh"
[120] PID: 5121 TASK: ffff880115a9e080 COMMAND: "sh"
[120] PID: 5117 TASK: ffff8801152b0040 COMMAND: "sh"
[120] PID: 5119 TASK: ffff880115a9eae0 COMMAND: "sh"
We can see that the output is kind of incorrect.
After applying the attached patch, crash seems to work well:
crash> runq
CPU 0 RUNQUEUE: ffff880028216680
CURRENT: PID: 5125 TASK: ffff88010799d540 COMMAND: "sh"
RT PRIO_ARRAY: ffff880028216808
[ 0] PID: 5136 TASK: ffff8801153cc040 COMMAND: "rtloop99"
CHILD RT PRIO_ARRAY: ffff88013b050000
[ 0] PID: 5133 TASK: ffff88010799c080 COMMAND: "rtloop99"
[ 1] PID: 5131 TASK: ffff880037922aa0 COMMAND: "rtloop98"
[ 98] PID: 5128 TASK: ffff88011bd87540 COMMAND: "rtloop1"
PID: 5130 TASK: ffff8801396e7500 COMMAND: "rtloop1"
PID: 5129 TASK: ffff88011bf5a080 COMMAND: "rtloop1"
PID: 6 TASK: ffff88013d7c6080 COMMAND: "watchdog/0"
PID: 3 TASK: ffff88013d7ba040 COMMAND: "migration/0"
[ 1] PID: 5134 TASK: ffff8801153cd500 COMMAND: "rtloop98"
PID: 5135 TASK: ffff8801153ccaa0 COMMAND: "rtloop98"
CFS RB_ROOT: ffff880028216718
[120] PID: 5109 TASK: ffff880037923500 COMMAND: "sh"
[120] PID: 5107 TASK: ffff88006eeccaa0 COMMAND: "sh"
[120] PID: 5123 TASK: ffff880107a4caa0 COMMAND: "sh"
CPU 1 RUNQUEUE: ffff880028296680
CURRENT: PID: 5086 TASK: ffff88006eecc040 COMMAND: "bash"
RT PRIO_ARRAY: ffff880028296808
[ 0] PID: 5137 TASK: ffff880107b35540 COMMAND: "rtloop99"
PID: 10 TASK: ffff88013cc2cae0 COMMAND: "watchdog/1"
PID: 2852 TASK: ffff88013bd5aae0 COMMAND: "rtkit-daemon"
[ 54] CHILD RT PRIO_ARRAY: ffff880138978000
[ 54] PID: 5132 TASK: ffff88006eecd500 COMMAND: "rtloop45"
CFS RB_ROOT: ffff880028296718
[120] PID: 5115 TASK: ffff8801152b1500 COMMAND: "sh"
[120] PID: 5113 TASK: ffff880139530080 COMMAND: "sh"
[120] PID: 5111 TASK: ffff88011bd86080 COMMAND: "sh"
[120] PID: 5121 TASK: ffff880115a9e080 COMMAND: "sh"
[120] PID: 5117 TASK: ffff8801152b0040 COMMAND: "sh"
[120] PID: 5119 TASK: ffff880115a9eae0 COMMAND: "sh"
Is this kind of output for rt runqueues ok? Or do you have any suggestion?
Thanks
Zhang Yanfei