Hi  Tao

> 1. What is the root cause, why cpu offline state will lead to mapping
> errors? You didn't make this clear in your commit message.

The back CPU context  wrong order When the front CPU is offline.
exp:
CPU0, CPU1 is offline,  CPU2 is online,  CPU2's nt_prstatus_percpu from corefile note prstatus[0], because  i, j not equal.

> 2. Is this issue cpu arch specific? Since the example you gave is only
> arm64 specific, however the code change is generic for all archs. So
> I'm a bit worried about whether it will make the result incorrect for
> other archs. In addition, I didn't reproduce the mapping error issue
> on my x86_64 machine, so I cannot verify the patch locally.

N/A

> What command line you trigger will get the following outputs?
$ crash vmlinux vmcore -d 1
crash> help -r
 
Thanks,
Guanyou

Tao Liu <ltao@redhat.com> 于2024年11月21日周四 13:42写道:
Hi Guanyou,

Thanks for the fix. However I have a few questions for this patch:

On Fri, Nov 1, 2024 at 10:16 PM Guanyou Chen <chenguanyou9338@gmail.com> wrote:
>
> Hi  Lianbo, Tao
>
> When CPUs are in an offline state, it can lead to mapping errors.

1. What is the root cause, why cpu offline state will lead to mapping
errors? You didn't make this clear in your commit message.
2. Is this issue cpu arch specific? Since the example you gave is only
arm64 specific, however the code change is generic for all archs. So
I'm a bit worried about whether it will make the result incorrect for
other archs. In addition, I didn't reproduce the mapping error issue
on my x86_64 machine, so I cannot verify the patch locally.

> We need to map them to the correct positions one by one.
>

3. Please make the commit log tidy. I know you want to diff the
outputs of "before" and "after", but we only need the info which are
highly related to this issue. e.g:

> Before:

What command line you trigger will get the following outputs?

> n_namesz: 5 ("CPU2")
> n_descsz: 392
>   n_type: 1 (NT_PRSTATUS)
>           si.signo: 0  si.code: 0  si.errno: 0
>           cursig: 0  sigpend: 0  sighold: 0
>           pid: 3  ppid: 0  pgrp: 0  sid:0
>           utime: 0.000000  stime: 0.000000
>           cutime: 0.000000  cstime: 0.000000

I'm not sure if the above info is related to this issue?

>            X0: ffffffc000fc8818   X1: 0000000000000000   X2: ffffffc000fc84c8
>            X3: 0000000000000000   X4: ffffffc0405e37bf   X5: ffffffc00a07372f
>            X6: 322e34323320205b   X7: 545b5d3539383334   X8: ffffffc000fc2f0c
>            X9: 89fece0a9ef8cb00  X10: c0000001001f75f4  X11: 00000001001f75f4
>           X12: 0000000000000003  X13: 00000000000005f4  X14: ffffffc009eb1210
>           X15: 0000000000000004  X16: 000000002a4cec24  X17: 000000002a4cec24
>           X18: ffffffc009e7d140  X19: ffffffc00a04c670  X20: 0000000000000000
>           X21: 0000000000000000  X22: ffffff8027f22280  X23: 0000000000000009
>           X24: 0000000000000007  X25: ffffffc009f839c0  X26: ffffffc0090f87f8
>           X27: 0000000000000000  X28: ffffff80454f3840  X29: ffffffc0405e3b60
>            LR: ffffffc0080e57fc   SP: ffffffc0405e3b60   PC: ffffffc000fc2f84

We don't need a list of all regs, a few lines of which can indicate
the output mismatch for "Before" and "After" is enough.

>
> CPU 0: [OFFLINE]
> CPU 1: [OFFLINE]
> CPU 2:
>     X0: 0000000000000000   X1: 0000000000000000   X2: 0000000000000000
>     X3: 000000000003fcbc   X4: 0000000000000001   X5: 0000000000000000
>     X6: 0000000000000000   X7: 0000000000000000   X8: 00000000ffffffff
>     X9: ffffffc009e6ae48  X10: ffffffc009e6ae20  X11: 0000000000000000
>    X12: 0000000000000002  X13: 0000000000000004  X14: 0000000000000000
>    X15: 0000000000004000  X16: 00000000f90f05f6  X17: 00000000f90f05f6
>    X18: 0000000000000000  X19: 0000000000000002  X20: ffffffc009e3b008
>    X21: ffffffc00a01d020  X22: ffffffc009f798f0  X23: 0000000060001000
>    X24: 0000000000000000  X25: 0000000000000000  X26: 0000000000000000
>    X27: 0000000000000000  X28: ffffff8111eecb00  X29: ffffffc008003f50
>     LR: ffffffc00802df88   SP: ffffffc008003f40   PC: ffffffc00802df94
>    PSTATE: 024003c5   FPVALID: 00000000
>
> After:
> CPU 2:
>     X0: ffffffc000fc8818   X1: 0000000000000000   X2: ffffffc000fc84c8
>     X3: 0000000000000000   X4: ffffffc0405e37bf   X5: ffffffc00a07372f
>     X6: 322e34323320205b   X7: 545b5d3539383334   X8: ffffffc000fc2f0c
>     X9: 89fece0a9ef8cb00  X10: c0000001001f75f4  X11: 00000001001f75f4
>    X12: 0000000000000003  X13: 00000000000005f4  X14: ffffffc009eb1210
>    X15: 0000000000000004  X16: 000000002a4cec24  X17: 000000002a4cec24
>    X18: ffffffc009e7d140  X19: ffffffc00a04c670  X20: 0000000000000000
>    X21: 0000000000000000  X22: ffffff8027f22280  X23: 0000000000000009
>    X24: 0000000000000007  X25: ffffffc009f839c0  X26: ffffffc0090f87f8
>    X27: 0000000000000000  X28: ffffff80454f3840  X29: ffffffc0405e3b60

>     LR: ffffffc0080e57fc   SP: ffffffc0405e3b60   PC: ffffffc000fc2f84
>    PSTATE: 600000c5   FPVALID: 00000000
>
> crash> bt
> PID: 15959    TASK: ffffff80454f3840  CPU: 2    COMMAND: "AnrConsumer"
>     [ffffffc0405e3b60] ipanic at ffffffc000fc2f80 [mrdump]
>     [ffffffc0405e3b70] atomic_notifier_call_chain at ffffffc0080e57f8
>     [ffffffc0405e3c30] panic at ffffffc008f734d0
>     [ffffffc0405e3c80] sysrq_handle_crash at ffffffc0087f3c18
>     [ffffffc0405e3c90] __handle_sysrq at ffffffc0087f3798
>     [ffffffc0405e3ce0] write_sysrq_trigger at ffffffc0087f49c0
>     [ffffffc0405e3d00] proc_reg_write at ffffffc00842e4b8
>     [ffffffc0405e3d80] vfs_write at ffffffc008381eb4
>     [ffffffc0405e3dd0] ksys_write at ffffffc008382200
>     [ffffffc0405e3e10] __arm64_sys_write at ffffffc00838228c
>     [ffffffc0405e3e20] invoke_syscall at ffffffc00802efe0
>     [ffffffc0405e3e40] el0_svc_common at ffffffc00802eef4
>     [ffffffc0405e3e70] do_el0_svc at ffffffc00802ede8
>     [ffffffc0405e3e80] el0_svc at ffffffc008f7a7d0
>     [ffffffc0405e3ea0] el0t_64_sync_handler at ffffffc008f7a758
>     [ffffffc0405e3fe0] el0t_64_sync at ffffffc00801157c

The stacktrace is meaningless for this patch, we can cut this off.

>      PC: 00000077c798ca28   LR: 00000077a82e19f4   SP: 000000761c517af0
>     X29: 000000761c517b00  X28: 000000761c517db8  X27: 000000761c517c90
>     X26: 000000761c517c98  X25: 000000761c517bf9  X24: 000000761c519000
>     X23: 000000761c517be1  X22: 0000000000000001  X21: 00000000000003e3
>     X20: 000000761c517c11  X19: 000000761c517bf8  X18: 0000007568224000
>     X17: 00000077c798ca20  X16: 00000077c79b2ae0  X15: b4000077202cc480
>     X14: 0000000000000000  X13: 000000761c517a70  X12: ffffff80ffffffd0
>     X11: 000000761c517a40  X10: 0000000000000001   X9: 0000000000000000
>      X8: 0000000000000040   X7: 7f7f7f7f7f7f7f7f   X6: 0000000000000010
>      X5: 000000761c517c0c   X4: ffffffffffffffff   X3: ffffffffffffffff
>      X2: 0000000000000001   X1: 000000761c517c11   X0: 00000000000003e3
>     ORIG_X0: 00000000000003e3  SYSCALLNO: 40  PSTATE: 00001000
>
ditto.

Maybe a short summary after the output paste, telling us which info
should reviewers pay attention to.

Thanks,
Tao Liu

> Signed-off-by: Guanyou.Chen <chenguanyou@xiaomi.com>
> ---
>  netdump.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/netdump.c b/netdump.c
> index b4e2a5c..8ea5159 100644
> --- a/netdump.c
> +++ b/netdump.c
> @@ -75,7 +75,7 @@ void
>  map_cpus_to_prstatus(void)
>  {
>     void **nt_ptr;
> -   int online, i, j, nrcpus;
> +   int online, i, nrcpus;
>     size_t size;
>
>     if (pc->flags2 & QEMU_MEM_DUMP_ELF)  /* notes exist for all cpus */
> @@ -100,9 +100,9 @@ map_cpus_to_prstatus(void)
>      */
>     nrcpus = (kt->kernel_NR_CPUS ? kt->kernel_NR_CPUS : NR_CPUS);
>
> -   for (i = 0, j = 0; i < nrcpus; i++) {
> +   for (i = 0; i < nrcpus; i++) {
>         if (in_cpu_map(ONLINE_MAP, i) && machdep->is_cpu_prstatus_valid(i)) {
> -           nd->nt_prstatus_percpu[i] = nt_ptr[j++];
> +           nd->nt_prstatus_percpu[i] = nt_ptr[i];
>             nd->num_prstatus_notes =
>                 MAX(nd->num_prstatus_notes, i+1);
>         }
> --
> 2.34.1
>
> Guanyou.
> Thanks.