Hi Tao
1. What is the root cause, why cpu offline state will lead to
mapping
errors? You didn't make this clear in your commit message.
The back CPU context wrong order When the front CPU is offline.
exp:
CPU0, CPU1 is offline, CPU2 is online, CPU2's nt_prstatus_percpu from
corefile note prstatus[0], because i, j not equal.
2. Is this issue cpu arch specific? Since the example you gave is
only
arm64 specific, however the code change is generic for all archs. So
I'm a bit worried about whether it will make the result incorrect for
other archs. In addition, I didn't reproduce the mapping error issue
on my x86_64 machine, so I cannot verify the patch locally.
N/A
What command line you trigger will get the following outputs?
$
crash vmlinux vmcore -d 1
crash> help -r
Thanks,
Guanyou
Tao Liu <ltao(a)redhat.com> 于2024年11月21日周四 13:42写道:
> Hi Guanyou,
>
> Thanks for the fix. However I have a few questions for this patch:
>
> On Fri, Nov 1, 2024 at 10:16 PM Guanyou Chen <chenguanyou9338(a)gmail.com>
> wrote:
> >
> > Hi Lianbo, Tao
> >
> > When CPUs are in an offline state, it can lead to mapping errors.
>
1. What is the root cause, why cpu offline state will lead to
mapping
errors? You didn't make this clear in your commit message.
2. Is this issue cpu arch specific? Since the example you gave is
only
arm64 specific, however the code change is generic for all archs. So
I'm a bit worried about whether it will make the result incorrect for
other archs. In addition, I didn't reproduce the mapping error issue
on my x86_64 machine, so I cannot verify the patch locally.
>
> > We need to map them to the correct positions one by one.
> >
>
> 3. Please make the commit log tidy. I know you want to diff the
> outputs of "before" and "after", but we only need the info which
are
> highly related to this issue. e.g:
>
> > Before:
>
What command line you trigger will get the following outputs?
>
> > n_namesz: 5 ("CPU2")
> > n_descsz: 392
> > n_type: 1 (NT_PRSTATUS)
> > si.signo: 0 si.code: 0 si.errno: 0
> > cursig: 0 sigpend: 0 sighold: 0
> > pid: 3 ppid: 0 pgrp: 0 sid:0
> > utime: 0.000000 stime: 0.000000
> > cutime: 0.000000 cstime: 0.000000
>
> I'm not sure if the above info is related to this issue?
>
> > X0: ffffffc000fc8818 X1: 0000000000000000 X2:
> ffffffc000fc84c8
> > X3: 0000000000000000 X4: ffffffc0405e37bf X5:
> ffffffc00a07372f
> > X6: 322e34323320205b X7: 545b5d3539383334 X8:
> ffffffc000fc2f0c
> > X9: 89fece0a9ef8cb00 X10: c0000001001f75f4 X11:
> 00000001001f75f4
> > X12: 0000000000000003 X13: 00000000000005f4 X14:
> ffffffc009eb1210
> > X15: 0000000000000004 X16: 000000002a4cec24 X17:
> 000000002a4cec24
> > X18: ffffffc009e7d140 X19: ffffffc00a04c670 X20:
> 0000000000000000
> > X21: 0000000000000000 X22: ffffff8027f22280 X23:
> 0000000000000009
> > X24: 0000000000000007 X25: ffffffc009f839c0 X26:
> ffffffc0090f87f8
> > X27: 0000000000000000 X28: ffffff80454f3840 X29:
> ffffffc0405e3b60
> > LR: ffffffc0080e57fc SP: ffffffc0405e3b60 PC:
> ffffffc000fc2f84
>
> We don't need a list of all regs, a few lines of which can indicate
> the output mismatch for "Before" and "After" is enough.
>
> >
> > CPU 0: [OFFLINE]
> > CPU 1: [OFFLINE]
> > CPU 2:
> > X0: 0000000000000000 X1: 0000000000000000 X2: 0000000000000000
> > X3: 000000000003fcbc X4: 0000000000000001 X5: 0000000000000000
> > X6: 0000000000000000 X7: 0000000000000000 X8: 00000000ffffffff
> > X9: ffffffc009e6ae48 X10: ffffffc009e6ae20 X11: 0000000000000000
> > X12: 0000000000000002 X13: 0000000000000004 X14: 0000000000000000
> > X15: 0000000000004000 X16: 00000000f90f05f6 X17: 00000000f90f05f6
> > X18: 0000000000000000 X19: 0000000000000002 X20: ffffffc009e3b008
> > X21: ffffffc00a01d020 X22: ffffffc009f798f0 X23: 0000000060001000
> > X24: 0000000000000000 X25: 0000000000000000 X26: 0000000000000000
> > X27: 0000000000000000 X28: ffffff8111eecb00 X29: ffffffc008003f50
> > LR: ffffffc00802df88 SP: ffffffc008003f40 PC: ffffffc00802df94
> > PSTATE: 024003c5 FPVALID: 00000000
> >
> > After:
> > CPU 2:
> > X0: ffffffc000fc8818 X1: 0000000000000000 X2: ffffffc000fc84c8
> > X3: 0000000000000000 X4: ffffffc0405e37bf X5: ffffffc00a07372f
> > X6: 322e34323320205b X7: 545b5d3539383334 X8: ffffffc000fc2f0c
> > X9: 89fece0a9ef8cb00 X10: c0000001001f75f4 X11: 00000001001f75f4
> > X12: 0000000000000003 X13: 00000000000005f4 X14: ffffffc009eb1210
> > X15: 0000000000000004 X16: 000000002a4cec24 X17: 000000002a4cec24
> > X18: ffffffc009e7d140 X19: ffffffc00a04c670 X20: 0000000000000000
> > X21: 0000000000000000 X22: ffffff8027f22280 X23: 0000000000000009
> > X24: 0000000000000007 X25: ffffffc009f839c0 X26: ffffffc0090f87f8
> > X27: 0000000000000000 X28: ffffff80454f3840 X29: ffffffc0405e3b60
>
> > LR: ffffffc0080e57fc SP: ffffffc0405e3b60 PC: ffffffc000fc2f84
> > PSTATE: 600000c5 FPVALID: 00000000
> >
> > crash> bt
> > PID: 15959 TASK: ffffff80454f3840 CPU: 2 COMMAND:
"AnrConsumer"
> > [ffffffc0405e3b60] ipanic at ffffffc000fc2f80 [mrdump]
> > [ffffffc0405e3b70] atomic_notifier_call_chain at ffffffc0080e57f8
> > [ffffffc0405e3c30] panic at ffffffc008f734d0
> > [ffffffc0405e3c80] sysrq_handle_crash at ffffffc0087f3c18
> > [ffffffc0405e3c90] __handle_sysrq at ffffffc0087f3798
> > [ffffffc0405e3ce0] write_sysrq_trigger at ffffffc0087f49c0
> > [ffffffc0405e3d00] proc_reg_write at ffffffc00842e4b8
> > [ffffffc0405e3d80] vfs_write at ffffffc008381eb4
> > [ffffffc0405e3dd0] ksys_write at ffffffc008382200
> > [ffffffc0405e3e10] __arm64_sys_write at ffffffc00838228c
> > [ffffffc0405e3e20] invoke_syscall at ffffffc00802efe0
> > [ffffffc0405e3e40] el0_svc_common at ffffffc00802eef4
> > [ffffffc0405e3e70] do_el0_svc at ffffffc00802ede8
> > [ffffffc0405e3e80] el0_svc at ffffffc008f7a7d0
> > [ffffffc0405e3ea0] el0t_64_sync_handler at ffffffc008f7a758
> > [ffffffc0405e3fe0] el0t_64_sync at ffffffc00801157c
>
> The stacktrace is meaningless for this patch, we can cut this off.
>
> > PC: 00000077c798ca28 LR: 00000077a82e19f4 SP: 000000761c517af0
> > X29: 000000761c517b00 X28: 000000761c517db8 X27: 000000761c517c90
> > X26: 000000761c517c98 X25: 000000761c517bf9 X24: 000000761c519000
> > X23: 000000761c517be1 X22: 0000000000000001 X21: 00000000000003e3
> > X20: 000000761c517c11 X19: 000000761c517bf8 X18: 0000007568224000
> > X17: 00000077c798ca20 X16: 00000077c79b2ae0 X15: b4000077202cc480
> > X14: 0000000000000000 X13: 000000761c517a70 X12: ffffff80ffffffd0
> > X11: 000000761c517a40 X10: 0000000000000001 X9: 0000000000000000
> > X8: 0000000000000040 X7: 7f7f7f7f7f7f7f7f X6: 0000000000000010
> > X5: 000000761c517c0c X4: ffffffffffffffff X3: ffffffffffffffff
> > X2: 0000000000000001 X1: 000000761c517c11 X0: 00000000000003e3
> > ORIG_X0: 00000000000003e3 SYSCALLNO: 40 PSTATE: 00001000
> >
> ditto.
>
> Maybe a short summary after the output paste, telling us which info
> should reviewers pay attention to.
>
> Thanks,
> Tao Liu
>
> > Signed-off-by: Guanyou.Chen <chenguanyou(a)xiaomi.com>
> > ---
> > netdump.c | 6 +++---
> > 1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/netdump.c b/netdump.c
> > index b4e2a5c..8ea5159 100644
> > --- a/netdump.c
> > +++ b/netdump.c
> > @@ -75,7 +75,7 @@ void
> > map_cpus_to_prstatus(void)
> > {
> > void **nt_ptr;
> > - int online, i, j, nrcpus;
> > + int online, i, nrcpus;
> > size_t size;
> >
> > if (pc->flags2 & QEMU_MEM_DUMP_ELF) /* notes exist for all cpus */
> > @@ -100,9 +100,9 @@ map_cpus_to_prstatus(void)
> > */
> > nrcpus = (kt->kernel_NR_CPUS ? kt->kernel_NR_CPUS : NR_CPUS);
> >
> > - for (i = 0, j = 0; i < nrcpus; i++) {
> > + for (i = 0; i < nrcpus; i++) {
> > if (in_cpu_map(ONLINE_MAP, i) &&
> machdep->is_cpu_prstatus_valid(i)) {
> > - nd->nt_prstatus_percpu[i] = nt_ptr[j++];
> > + nd->nt_prstatus_percpu[i] = nt_ptr[i];
> > nd->num_prstatus_notes =
> > MAX(nd->num_prstatus_notes, i+1);
> > }
> > --
> > 2.34.1
> >
> > Guanyou.
> > Thanks.
>
>