Crash-utility February 2012

devel@lists.crash-utility.osci.io

20 participants
43 discussions

by David Mair

I was using a crash dump last week where the CFS runqueue for one CPU had a loop in it. It was caused by a red/black tree node that was the left child of one node at the same time as being the right child of a different node. The doubly-linked node has only one parent, one of the nodes that has it as a child here). Call the node with two links to it X, the node that is the stored parent of X call Y and the other node that links to X call Z. The parent of Y is Z here but X is the right node of Z (left of Y). So, you reach Z, do all it's left branch then go right to reach X. Go up to Z from the left not Y from the right and do all the right of Z. Go up to Y and your last node isn't the left or right child so you go do the whole left tree of Y again, etc. In that case, the runq command (task.c) in crash never bails out of dump_CFS_runqueues() because the for loop that goes from rb_first() through all of rb_next() has no means of detecting the problem, although you can use the runq command with -d and only get the task list (which wasn't corrupted). The runq command alone then becomes very annoying to use when diagnosing the actual problem. It occurs to me that this could be improved with one or more cases to detect as additional exit conditions with warning messages for that loop: * Number of iterations exceeds cfs_rq_nr_running (perhaps allowing it to be exceeded by some small amount to have a chance of seeing the nature of a problem with it, perhaps with a runq command line switch to allow ignoring any problems) * At the node making the cfs_rq_nr_running iteration, note the node address, then exit with a warning if that node is ever displayed again, thus showing any loop that starts within the first cfs_rq_nr_running nodes. * A more complex loop detection that handles cases where cfs_rq_nr_running is significantly lower than the number of actual nodes in the valid part of the tree. * I suppose if you still have the last node processed and you are going up then that last node should always be the left or right child of the parent you are reaching or the tree is broken. I understand that cfs_rq_nr_running could contain the "corruption" so assuming it indicates the correct number of times to call rb_next() isn't correct either, so my preference is for some form of loop detection. Before I work on a patch, are there any opinions on the behavior of the runq command in the case of a corrupt CFS runqueue, e.g. one that contains a loop? -- David Mair SUSE Linux

13 years, 11 months

3
3
0 / 0

[PATCH 1/1] CFS runqueue loop detection

by David Mair

Here is a patch against crash v6.0.2 that adds duplicate node detection per-CPU for the CFS runqueue display in dump_CFS_runqueues() for the runq command. This resolves the failure to bail-out of the unending looping display I get with the crash dump I have that has a corrupted CFS runqueue containing a loop. Signed-off-by: David Mair <dmair(a)suse.com> --- task.c | 9 ++++++++- 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/task.c b/task.c index 433a043..0333fe8 100755 --- a/task.c +++ b/task.c @@ -7050,7 +7050,12 @@ dump_tasks_in_cfs_rq(ulong cfs_rq, ulong skip) OFFSET(sched_entity_run_node)); if (!tc) continue; - dump_task_runq_entry(tc); + if (hq_enter((ulong)tc)) { + dump_task_runq_entry(tc); + } else { + error(WARNING, "Duplicate CFS runqueue node, task %lx, probable loop\n", tc->task); + return total; + } total++; } @@ -7217,10 +7222,12 @@ dump_CFS_runqueues(void) fprintf(fp, " CFS RB_ROOT: %lx\n", (ulong)root); tot = 0; + hq_open(); if (curr_cfs_rq) tot += dump_tasks_in_cfs_rq(curr_cfs_rq, 0); if (cfs_rq != curr_cfs_rq) tot += dump_tasks_in_cfs_rq(cfs_rq, curr_cfs_rq); + hq_close(); if (!tot) { INDENT(5); fprintf(fp, "[no tasks queued]\n");

13 years, 11 months

2
3
0 / 0

Is there any support for userspace stack dump for crash utility?

by Lei Wen

Hi, I'd like to find help here that does anyone have tried to get the crash utility support the userspace stack dumping, or it already be supported? Certainly userspace crash would not cause kernel panic, but for more robust product, we force kernel panic and trigger the kdump at every userspace crash. So the support for dumping user stack is import to us. Anyone has some idea to this feature? If we could get the stack like stack dump from userspace to kernel space with parameter showing, then that is perfect. Example: -010|do_vfs_ioctl(filp = 0xCE575500, ?, ?, arg = 1181448720) -011|sys_ioctl(fd = 27, cmd = 30000, arg = 1181448720) -012|ret_fast_syscall(asm) -->|exception -013|__ioctl(asm) -014|ioctl(?, request = 0) Thanks, Lei

13 years, 11 months

2
4
0 / 0

← Newer
1
2
3
4
5
Older →

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Crash-utility February 2012