-----Original Message-----
----- Original Message -----
> I faced an incomplete vmcore previous week which was generated because
> system running in kdump kernel was somehow rebooted in the middle of
> copying vmcore.
>
> Unfortunately, in the incomplete vmcore, most of the tasks failed to
> be detected via PID hash table because the objects relevant to PID
> hash including ptes needed to refer to the objects were lost.
>
> Although I successfully found many of objects of task_struct from
> another data structure such as via a circular list of task_struct::tasks
> and via run queue, crash sub-commands never work with the following
> message if a given object is not contained in the task table:
>
> crash> bt 0xffffffff00000000
> bt: invalid task or pid value: 0xffffffff00000000
>
> To address this issue, I made a patch to add a command-line option
> to pass a list of addresses of task_struct objects to make crash
> try to detect them in task table.
>
> I made this in very short time and there may be better interface
> than command-line option.
>
> Tested on top of crash-7.2.5.
Yeah, what bothers me about this patch is that even though it worked for your
particular half-baked vmcore, it may never be of any help to anybody else
in the future.
It's similar in nature to patches that have posted that address a particular
unique kernel bug that was seen in one vmcore, but it would be highly unlikely
that the circumstances would ever be seen again.
I'm posting this patch because I think this could be useful for everyone...
It was unfortunate that incomplete vmcore was generated and sent to us but
there's case where engineers have to investigate issues based on the
incomplete vmcore.
I also think there would other cases where restoring task_struct could fail
due to pure software issues, for example, memory corruption bugs, and think
it natural that crash doesn't behave as expected when kernel data structures
are abnormal state.
I think options such as --minimal and --no_kmem_cache are to deal with
such cases and this feature is similar in this sense.
By the way, I feel like I saw vmcores where some error messages
were output during "(gathering task table data)" in the past and
I guess some tasks were missing there but this was the first case I actually
needed to try to restore them.
But in this case, it's even more unlikely given that it's dealing with
an incomplete vmcore. You were lucky that you were able to even
bring up a crash session at all -- and then were able to generate
a task list after that.
It was incomplete but was complete about 98%. The detection from PID
hash was affected by loss of the remaining 2%.
Following the task_struct.tasks list doesn't gather all of the
tasks in a task group, so it doesn't create a fully populated task
list, correct?
Yes, I needed to repeat iterating successfully detected task_struct
objects in order until all the target tasks were covered, and as your
guess, there's no guarantee that I found all the task_struct objects,
so I said 'many of'.
Plus it doesn't make sense to add it unless it's documented *how* to
create the task list to begin with.
How about writing use case in help message or/and manual page?
-T file
Make crash detect task_struct objects listed in file as in
task table. This is useful when your interesting tasks are
missing in task table. You may find your interesting
task_struct objects from various kernel data structures:
From task_struct::tasks:
crash> list task_struct.tasks -s task_struct.pid,comm -h ffff88003daa8000
ffff88003daa8000
pid = 1
comm = "systemd\000\060\000\000\000\000\000\000"
ffff88003daa8fd0
pid = 2
comm = "kthreadd\000\000\000\000\000\000\000"
ffff88003daa9fa0
pid = 3
comm = "ksoftirqd/0\000\000\000\000"
...<snip>...
From runqueue:
crash> runq
CPU 0 RUNQUEUE: ffff88003fc16cc0
CURRENT: PID: 2188 TASK: ffff8800360f8000 COMMAND: "foobar.sh"
RT PRIO_ARRAY: ffff88003fc16e50
[no tasks queued]
CFS RB_ROOT: ffff88003fc16d68
[no tasks queued]
CPU 1 RUNQUEUE: ffff88003fd16cc0
CURRENT: PID: 1 TASK: ffff88003daa8000 COMMAND: "systemd"
RT PRIO_ARRAY: ffff88003fd16e50
[no tasks queued]
CFS RB_ROOT: ffff88003fd16d68
[120] PID: 19054 TASK: ffff88000b684f10 COMMAND: "kworker/1:0"
[120] PID: 3863 TASK: ffff88003bd02f70 COMMAND: "emacs"
This might be lengthy under option section.
I don't know, let me think about this...
I don't think the current design is best.
For example, it might be better to be able to update task table at runtime
by some crash sub-command. Looking at source code, it appears that
task table is updated at command execution when needed, so it's not
so difficult?
void
exec_command(void)
{
...<snip>...
if ((ct = get_command_table_entry(args[0]))) {
if (ct->flags & REFRESH_TASK_TABLE) {
if (XEN_HYPER_MODE()) {
#ifdef XEN_HYPERVISOR_ARCH
xen_hyper_refresh_domain_context_space();
xen_hyper_refresh_vcpu_context_space();
#else
error(FATAL, XEN_HYPERVISOR_NOT_SUPPORTED);
#endif
} else if (!(pc->flags & MINIMAL_MODE)) {
tt->refresh_task_table();
sort_context_array();
sort_tgid_array();
}
}
Thanks.
HATAYAMA, Daisuke