Hi Lijiang,
Thanks for reviewing the patchset.
On Fri, Aug 30, 2024 at 11:36 PM lijiang <lijiang(a)redhat.com> wrote:
On Mon, Aug 26, 2024 at 11:55 AM <devel-request(a)lists.crash-utility.osci.io>
wrote:
>
> Date: Mon, 26 Aug 2024 15:52:26 +1200
> From: Tao Liu <ltao(a)redhat.com>
> Subject: [Crash-utility] [PATCH v6 00/14] gdb stack unwinding support
> for crash utility
> To: devel(a)lists.crash-utility.osci.io
> Cc: Tao Liu <ltao(a)redhat.com>
> Message-ID: <20240826035240.14781-1-ltao(a)redhat.com>
> Content-Type: text/plain; charset=UTF-8
>
> This patchset is a rebase/merged version of the following 3 patchsets:
>
> 1): [PATCH v10 0/5] Improve stack unwind on ppc64 [1]
> 2): [PATCH 0/5] x86_64 gdb stack unwinding support [2]
> 3): Clean up on top of one-thread-v2 [3]
>
> A complete description of gdb stack unwinding support for crash can be
> found in [1].
>
> This patchset can be divided into the following 2 parts:
>
> 1) part1: arch independent, mainly modify on the
> crash_target.c/gdb_interface.c files, in preparation of the
> gdb side.
> 2) part2: arch specific part, for implementing ppc64/x86_64/arm64/vmware
> gdb stack unwinding support.
>
> === part 2
>
> - arm64:
> arm64: Add gdb stack unwinding support
>
> - vmware:
> vmware_guestdump: Various format versions support
> set_context(): check if context is already current
>
> - x86_64:
> x86_64: Fix invalid input "=>" for bt command
> Fix cpumask_t recursive dependence issue
> x86_64: Add gdb stack unwinding support
>
> - ppc64:
> ppc64: correct gdb passthroughs by implementing machdep->get_cpu_reg
>
> === part 1
>
> Stop stack unwinding at non-kernel address
> Fix gdb_interface: restore gdb's output streams at end of gdb_interface
> Print task pid/command instead of CPU index
> Rename get_cpu_reg to get_current_task_reg
> Let crash change gdb context
> Leave only one gdb thread for crash
> Remove 'frame' from prohibited commands list
> ===
>
> v6 -> v5:
> 1) Refactor patch 4 & 9, which changed the function signature of struct
> get_cpu_reg/get_current_task_reg, and let each patch compile with no
> error when added on.
> 2) Rebased the patchset on top of latest upstream:
> ("79b93ecb2e72ec Fix a "Bus error" issue caused by 'crash
--osrelease' or
> crash loading")
>
Thank you for the update, Tao.
I have a few more comments here:
[1] [PATCH v6 04/14]:
+ * Task is active, grab CPU's registers
+ */
+ if (is_task_active(tc->task) && VMSS_DUMPFILE())
+ return vmware_vmss_get_cpu_reg(tc->processor, regno, name, size,
value);
Can you help confirm that it only works for the active task? Is this expected behavior?
Yes, by reading through the code context, it looks to me the vmss
contains register status for each CPU, which are valid for active
tasks. For those inactive tasks, there is no code to get the registers
from stack frame like the one we deal for normal kernel. So I think it
is safe to code like this.
[2] [PATCH v6 07/14]
#ifdef CRASH_MERGE
CORE_ADDR pc = 0;
get_frame_pc_if_available (fi, &pc);
if (!is_kvaddr(pc)) {
printf_filtered (_("Backtrace stopped due to non-kernel addr:
%lx\n"),pc);
fi = NULL;
break;
}
#endif
I would suggest removing the above printf_filtered(...), otherwise it will be always
displayed.
Hmm... I would suggest to keep the printf_filtered() function. I
encountered cases which nothing get outputted after "crash> gdb bt"
without the printf_filtered(), it is quite misleading because user
have no idea what's going on. With the printf_filtered(), there would
be hints indicating the failure for gdb stack unwinding. To
illustrate:
crash> bt
PID: 138 TASK: ffff8804280bb8e0 CPU: 3 COMMAND: "rcuos/0"
#0 [ffff8804280c5910] machine_kexec at ffffffff81041181
#1 [ffff8804280c5968] crash_kexec at ffffffff810cf0e2
#2 [ffff8804280c5a38] oops_end at ffffffff815ea548
#3 [ffff8804280c5a60] no_context at ffffffff815daf63
#4 [ffff8804280c5ab0] __bad_area_nosemaphore at ffffffff815daff9
#5 [ffff8804280c5af8] bad_area_nosemaphore at ffffffff815db163
#6 [ffff8804280c5b08] __do_page_fault at ffffffff815ed36e
#7 [ffff8804280c5c08] do_page_fault at ffffffff815ed58a
#8 [ffff8804280c5c30] page_fault at ffffffff815e97c8
[exception RIP: unknown or invalid address]
RIP: 0000000000000000 RSP: ffff8804280c5ce0 RFLAGS: 00010046
RAX: ffffffff8160ea00 RBX: ffff88042e614580 RCX: ffff88042e600000
RDX: 0000000000000005 RSI: ffffffff818d5440 RDI: ffff88042e614580
RBP: ffff8804280c5d00 R8: 0000000000000000 R9: 0000000180400027
R10: ffffffff811b153f R11: ffffea0000dc8b00 R12: ffff88042e614580
R13: ffffffff818d5c44 R14: 0000000000000046 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff8804280c5ce0] enqueue_task at ffffffff810934dc
#10 [ffff8804280c5d08] activate_task at ffffffff81094c33
#11 [ffff8804280c5d18] ttwu_do_activate.constprop.85 at ffffffff81094f63
...
crash> gdb bt
Backtrace stopped due to non-kernel addr: 0
There are cases whose RIP value is invalid, e.g. NULL or non-kernel
address. Currently the patchset cannot unwind the stack frames based
on the faulty RIP value, and will fail silently without the
printf_filtered(). A "Backtrace stopped due to non-kernel addr: 0"
would be a good hint for user to know the fail reason with the info
combinated with cmd "bt".
[3] [PATCH v6 08/14]
a. warning
cc -c -g -DPPC64 -m64 -DLZO -DGDB_10_2 ppc64.c -Wall -O2 -Wstrict-prototypes
-Wmissing-prototypes -fstack-protector -Wformat-security
ppc64.c: In function ‘ppc64_get_current_task_reg’:
ppc64.c:2519:15: warning: unused variable ‘task’ [-Wunused-variable]
2519 | ulong task;
| ^~~~
Will eliminate it.
b. There are different results between 'bt' and 'gdb bt' commands. Can
you help double check?
crash> bt
PID: 6298 TASK: c000000050bbcb00 CPU: 2 COMMAND: "bash"
R0: c0000000002be63c R1: c00000005102b710 R2: c000000001bf8f00
R3: c00000005102b728 R4: 0000000000000000 R5: 0000000000000000
R6: c00000005102b8d8 R7: c00000000f5d0000 R8: 0000000000000000
R9: c0000000557d3c00 R10: 0000000000000001 R11: 0000000000002000
R12: c0000000027634a8 R13: c00000000f6cdf00 R14: 0000000000000000
R15: 0000000000000000 R16: 0000000000000000 R17: 0000000000000000
R18: 0000000000000000 R19: 0000000000000000 R20: 0000000000000000
R21: 0000000000000000 R22: 0000000000000000 R23: 0000000000000000
R24: 0000000000000007 R25: 0000000000000000 R26: 0000000000000000
R27: c00000000274d898 R28: c000000002d0a8d0 R29: c000000002e23df0
R30: 0000000000000000 R31: c000000002e23ddc
NIP: c0000000002bdf64 MSR: 8000000000009033 OR3: 0000000000000000
CTR: 0000000000000000 LR: c0000000002be63c XER: 0000000020040004
CCR: 0000000024222280 MQ: 0000000000000001 DAR: 0000000000000000
DSISR: 0000000000000000 Syscall Result: 0000000000000000
[NIP : crash_setup_regs+68]
[LR : __crash_kexec+156]
#0 [c00000005102b710] crash_setup_regs at c0000000002bdf64
crash> gdb bt
#0 0xc0000000002bdf64 in crash_setup_regs (newregs=0xc00000005102b728, oldregs=0x0) at
./arch/powerpc/include/asm/kexec.h:133
#1 0xc0000000002be658 in __crash_kexec (regs=0x0) at kernel/crash_core.c:122
#2 0xc00000000016c284 in panic (fmt=0xc0000000015dd018 "sysrq triggered
crash\n") at kernel/panic.c:367
#3 0xc000000000a66e78 in sysrq_handle_crash (key=<optimized out>) at
drivers/tty/sysrq.c:154
#4 0xc000000000a67994 in __handle_sysrq (key=key@entry=99 'c',
check_mask=check_mask@entry=false) at drivers/tty/sysrq.c:612
#5 0xc000000000a68454 in write_sysrq_trigger (file=<optimized out>,
buf=<optimized out>, count=2, ppos=<optimized out>) at
drivers/tty/sysrq.c:1181
#6 0xc00000000072868c in pde_write (pde=0xc000000003671f80, file=<optimized out>,
buf=<optimized out>, count=<optimized out>, ppos=<optimized out>) at
fs/proc/inode.c:334
#7 proc_reg_write (file=<optimized out>, buf=<optimized out>,
count=<optimized out>, ppos=<optimized out>) at fs/proc/inode.c:346
#8 0xc00000000063bf20 in vfs_write (file=0xc000000009cb5b00, buf=0x100123a71a0
<error: Cannot access memory at address 0x100123a71a0>, count=2,
pos=0xc00000005102bc00) at fs/read_write.c:588
#9 vfs_write (file=0xc000000009cb5b00, buf=0x100123a71a0 <error: Cannot access memory
at address 0x100123a71a0>, count=<optimized out>, pos=0xc00000005102bc00) at
fs/read_write.c:570
#10 0xc00000000063c4d0 in ksys_write (fd=<optimized out>, buf=0x100123a71a0
<error: Cannot access memory at address 0x100123a71a0>, count=2) at
fs/read_write.c:643
#11 0xc000000000031a28 in system_call_exception (regs=0xc00000005102be80,
r0=<optimized out>) at arch/powerpc/kernel/syscall.c:153
#12 0xc00000000000d05c in system_call_vectored_common () at
arch/powerpc/kernel/interrupt_64.S:198
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
crash>
I didn't get your point of "different results", which part is different?
[4] [PATCH v6 10/14]
diff --git a/xen_hyper.c b/xen_hyper.c
index 32e56fa..54c7f57 100644
--- a/xen_hyper.c
+++ b/xen_hyper.c
@@ -52,7 +52,7 @@ xen_hyper_init(void)
*/
xht->xen_virt_start &= 0xffffffffc0000000;
#endif
-
+ STRUCT_SIZE_INIT(cpumask_t, "cpumask_t");
if (machine_type("X86_64") &&
symbol_exists("xen_phys_start") && !xen_phys_start())
error(WARNING,
Could you please explain why the above changes are needed?
Yes it is needed. For normal kernel, the cpumask_t is initialized in
kernel_init(), for xen_hyper it will not call kernel_init(), so it
should be initialized in xen_hyper_init() before use.
[5] [PATCH v6 14/14]
a. warning
cc -c -g -DARM64 -DLZO -DGDB_10_2 arm64.c -Wall -O2 -Wstrict-prototypes
-Wmissing-prototypes -fstack-protector -Wformat-security
arm64.c: In function ‘arm64_get_stack_frame’:
arm64.c:4138:13: warning: variable ‘ret’ set but not used [-Wunused-but-set-variable]
4138 | int ret;
| ^~~
Will remove it.
b. indentation issue
- bt->task);
+ ur_bitmap = (struct user_regs_bitmap_struct
*)GETBUF(sizeof(*ur_bitmap));
+ memset(ur_bitmap, 0, sizeof(*ur_bitmap));
+ ur_bitmap->ur.pc = stackframe.pc;
[6] The 'info threads' command only displays the current task, not all known
tasks. And this looks different behavior from the gdb
gdb> info threads
Id Target Id Frame
* 1 6298 bash 0xc0000000002bdf64 in crash_setup_regs
(newregs=0xc00000005102b728, oldregs=0x0) at ./arch/powerpc/include/asm/kexec.h:133
[7] the 'thread' command can not switch to another task.
crash> ps
...
6297 6287 0 c000000057517e80 IN 0.0 22784 11776 sshd-session
> 6298 6297 2 c000000050bbcb00 RU 0.0 9664 6272 bash
...
gdb> thread 6297
gdb: gdb request failed: thread 6297
gdb>
Finally I have to do it with 'set' command as below:
crash> set 6297
PID: 6297
COMMAND: "sshd-session"
TASK: c000000057517e80 [THREAD_INFO: c000000057517e80]
CPU: 0
STATE: TASK_INTERRUPTIBLE
crash> set gdb on
gdb: on
gdb> bt
#0 0xc00000009cd57650 in ?? ()
gdb: gdb request failed: bt
gdb>
Yes, the behaviour is expected to me. Previously crash will
initialized the number of threads same as the cpu number. In fact
there is no need to do that, only one thread for gdb is enough for
crash-use, see the discussion in [1]. The one-thread will act like a
container, when set <pid>, the thread will be filled with task <pid>
context. So there is no need for "thread <id>" to swith thread. Just
"set <pid>" is enough.
[1]:
https://www.mail-archive.com/devel@lists.crash-utility.osci.io/msg00550.html
In addition, the code is changed, but its patch log is not updated accordingly. Could you
please double check? I won't list them here.
OK will do
Thanks,
Tao Liu
Thanks.
Lianbo
> v5 -> v4:
> 1) Plenty of code refactoring based on Lianbo's comments on v4.
> 2) Removed the magic number when dealing with regs bitmap, see [6].
> 3) Rebased the patchset on top of latest upstream:
> ("1c6da3eaff8207 arm64: Fix bt command show wrong stacktrace on ramdump
source")
>
> v4 -> v3:
> Fixed the author issue in [PATCH v3 06/16] Fix gdb_interface: restore gdb's
> output streams at end of gdb_interface.
>
> v3 -> v2:
> 1) Updated CC list as pointed out in [4]
> 2) Compiling issues as in [5]
>
> v2 -> v1:
> 1) Added the patch: x86_64: Fix invalid input "=>" for bt command,
> thanks for Kazu's testing.
> 2) Modify the patch: x86_64: Add gdb stack unwinding support, added the
> pcp_save, spp_save and sp, for restoring the value in match of the original
> code logic.
>
> [1]:
https://www.mail-archive.com/devel@lists.crash-utility.osci.io/msg00469.html
> [2]:
https://www.mail-archive.com/devel@lists.crash-utility.osci.io/msg00488.html
> [3]:
https://www.mail-archive.com/devel@lists.crash-utility.osci.io/msg00554.html
> [4]:
https://www.mail-archive.com/devel@lists.crash-utility.osci.io/msg00681.html
> [5]:
https://www.mail-archive.com/devel@lists.crash-utility.osci.io/msg00715.html
> [6]:
https://www.mail-archive.com/devel@lists.crash-utility.osci.io/msg00819.html
>
> Aditya Gupta (3):
> Remove 'frame' from prohibited commands list
> Fix gdb_interface: restore gdb's output streams at end of
> gdb_interface
> ppc64: correct gdb passthroughs by implementing machdep->get_cpu_reg
>
> Alexey Makhalov (2):
> set_context(): check if context is already current
> vmware_guestdump: Various format versions support
>
> Tao Liu (9):
> Leave only one gdb thread for crash
> Let crash change gdb context
> Rename get_cpu_reg to get_current_task_reg
> Print task pid/command instead of CPU index
> Stop stack unwinding at non-kernel address
> x86_64: Add gdb stack unwinding support
> Fix cpumask_t recursive dependence issue
> x86_64: Fix invalid input "=>" for bt command
> arm64: Add gdb stack unwinding support
>
> arm64.c | 115 +++++++++++++++-
> crash_target.c | 71 ++++++----
> defs.h | 194 ++++++++++++++++++++++++++-
> gdb-10.2.patch | 82 ++++++++++++
> gdb_interface.c | 35 ++---
> kernel.c | 63 +++++++--
> ppc64.c | 175 +++++++++++++++++++++++-
> symbols.c | 15 +++
> task.c | 34 +++--
> tools.c | 10 +-
> unwind_x86_64.h | 4 -
> vmware_guestdump.c | 321 +++++++++++++++++++++++++++++++-------------
> x86_64.c | 323 ++++++++++++++++++++++++++++++++++++++++-----
> xen_hyper.c | 2 +-
> 14 files changed, 1225 insertions(+), 219 deletions(-)
>
> --
> 2.40.1