Hello Dave,
On Mon, 2011-09-19 at 11:05 -0400, Dave Anderson wrote:
> WARNING: multiple active tasks have called die and/or panic
> WARNING: multiple active tasks have called die
>
> In task.c we call "foreach bt -t" and check if we find "die" on
the stack. When
> doing this on s390 with the "-t" option normally we find multiple die()
calls
> for one single task:
>
> crash> foreach bt -t | grep "die at"
> [ 9ca7f7f0] die at 100f26
> [ 9ca7f8f0] die at 100f26
> [ 9ca7f9b8] die at 100f26
> [ 9ca7fa40] die at 100ee6
> [ 9ca7fa90] die at 100f26
>
> The current code then assumes that multiple tasks have called die().
>
> This patch fixes this problem by an additional check that allows multiple
> occurrences of the die() call on the stack (with bt -t) for one task.
Strange -- has this always happened on s390's?
I don't think so, although I have seen that warning already several
times in the past. But until now I did not had the time to look into
this issue.
And I wonder why
why there are multiple instances on the stack?
I think the reason is the -t option. It just finds multiple instances of
addresses that point to the die() function on the stack. I don't know
the exact reason, but the compiler can place whatever it wants to the
stack.
The current stack pointer is 9d7d3768 and the stack area is [9d7d3768 -
9d7d4000]:
crash> bt -t | grep die
[ 9d7d38b8] die at 100f26
[ 9d7d3988] die at 100f26
[ 9d7d3a40] die at 100ee6
[ 9d7d3a90] die at 100f26
crash> rd 9d7d38b8
9d7d38b8: 0000000000100f26 .......&
crash> rd 9d7d3988
9d7d3988: 0000000000100f26 .......&
crash> rd 9d7d3a40
9d7d3a40: 0000000000100ee6 ........
crash> rd 9d7d3a90
9d7d3a90: 0000000000100f26 .......&
What does the actual
backtrace look like?
The "normal" backtrace looks like the following:
crash> bt
PID: 10 TASK: 9d7bdba0 CPU: 0 COMMAND: "kworker/0:1"
LOWCORE INFO:
-psw : 0x0400100180000000 0x0000000000114630
-function : store_status at 114630
-prefix : 0x7ff08000
-cpu timer: 0x7fff15c0 0x0066b7fa
-clock cmp: 0x0066b7fa 0000000000
-general registers:
000000000000000000 0x00000000001060a0
0x0400000180000000 0x000000009cb1ec00
0x000000000011d48c 0x0000000000000040
000000000000000000 0x00000000009c8c68
0x000000009cb1ec00 0x000000000011d4ac
0x000000009cb1ec00 0x000000000011dc18
0x000000009cb1ec00 0x00000000005b9870
0x0000000000111d08 0x000000009d7d3768
-access registers:
0x000003ff 0xfd3f76f0 0000000000 0000000000
0000000000 0000000000 0000000000 0000000000
0000000000 0000000000 0000000000 0000000000
0000000000 0000000000 0000000000 0000000000
-control registers:
0x0000000004046e12 0x00000000009c2007
0x0000000000011140 000000000000000000
0x000000000000000a 0x0000000000011140
0x0000000051000000 0x00000000009c2007
000000000000000000 000000000000000000
000000000000000000 000000000000000000
000000000000000000 0x00000000901bc1c7
0x00000000db000000 000000000000000000
-floating point registers 0,2,4,6:
0x4048000000000000 000000000000000000
000000000000000000 000000000000000000
000000000000000000 000000000000000000
000000000000000000 000000000000000000
000000000000000000 000000000000000000
000000000000000000 000000000000000000
000000000000000000 000000000000000000
000000000000000000 000000000000000000
#0 [9d7d37a8] __machine_kexec at 11d4fa
#1 [9d7d37f0] smp_switch_to_ipl_cpu at 116ebe
#2 [9d7d3860] machine_kexec at 11d49c
#3 [9d7d3890] crash_kexec at 19ab26
#4 [9d7d3960] panic at 5af192
#5 [9d7d3a08] die at 100f26
#6 [9d7d3a70] do_no_context at 11e910
#7 [9d7d3aa8] do_protection_exception at 5b551a
#8 [9d7d3bc0] pgm_exit at 5b34b8
PSW: 0404100180000000 0000000000402d04 (sysrq_handle_crash+16)
GPRS: 0000000000010000 00000000009c8c74 0000000000000001 0000000000000000
00000000005af34e 00000000009c90e4 000000000091d3b0 0000000000a67960
070000000016b628 0000000000000001 0000000000959530 0000000000000063
00000000009596d0 0000000000606c60 000000000040309c 000000009d7d3d08
#0 [9d7d3d70] process_one_work at 166abe
#1 [9d7d3dd8] worker_thread at 1672da
#2 [9d7d3e50] kthread at 1705b6
#3 [9d7d3eb8] kernel_thread_starter at 5b2e3a
In any case, I guess the patch makes sense,
although I wonder why nobody else has ever reported this.
I assume that everybody has just ignored the warning...
By any chance, given that this must be zdump-type dumpfile (?), does
the "dh_cpu_id" member in the header correlate to the panic cpu?
Not necessarily. We have code that switches to the original boot CPU in
case of panic. So the dumping CPU normally is not the CPU that called
panic().
Or
is there any other way that the panic'ing task can be ascertained from
"S390D" dumpfiles such that get_dumpfile_panic_task() can do the job?
Hmmm, I don't think so. Probably the only way is to search die or panic
on the stack. Perhaps we can do that without the -t option?
Michael