On Fri, May 31, 2024 at 5:30 PM <devel-request@lists.crash-utility.osci.io> wrote:
Date: Fri, 31 May 2024 17:19:30 +0800
From: Tao Liu <ltao@redhat.com>
Subject: [Crash-utility] [PATCH v4 07/16] Stop stack unwinding at
        non-kernel address
To: devel@lists.crash-utility.osci.io
Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com>, "Naveen N . Rao"
        <naveen.n.rao@linux.vnet.ibm.com>, Lianbo Jiang <lijiang@redhat.com>,
        Alexey Makhalov <alexey.makhalov@broadcom.com>
Message-ID: <20240531091939.97828-8-ltao@redhat.com>
Content-Type: text/plain; charset=UTF-8

The stack unwinding is for kernel addresses only. If non-kernel address
encountered, it is usually a user space address, or non-address value
like a function call parameter. So stopping stack unwinding at non-kernel
address will decrease the invalid unwind results.

Before:
crash> gdb bt
 #0  0xffffffff816a8f65 in context_switch ...
 #1  __schedule () ...
 #2  0xffffffff816a94e9 in schedule ...
 #3  0xffffffff816a86fd in schedule_hrtimeout_range_clock ...
 #4  0xffffffff816a8733 in schedule_hrtimeout_range ...
 #5  0xffffffff8124bb7e in ep_poll ...
 #6  0xffffffff8124d00d in SYSC_epoll_wait ...
 #7  SyS_epoll_wait ...
 #8  <signal handler called>
 #9  0x00007f0449407923 in ?? ()
 #10 0xffff880100000001 in ?? ()
 #11 0xffff880169b3c010 in ?? ()
 #12 0x0000000000000040 in irq_stack_union ()
 #13 0xffff880169b3c058 in ?? ()
 #14 0xffff880169b3c048 in ?? ()
 #15 0xffff880169b3c050 in ?? ()
 #16 0x0000000000000000 in ?? ()

After:
crash> gdb bt
 #0  0xffffffff816a8f65 in context_switch ...
 #1  __schedule () ...
 #2  0xffffffff816a94e9 in schedule () ...
 #3  0xffffffff816a86fd in schedule_hrtimeout_range_clock ...
 #4  0xffffffff816a8733 in schedule_hrtimeout_range ...
 #5  0xffffffff8124bb7e in ep_poll ...
 #6  0xffffffff8124d00d in SYSC_epoll_wait ...
 #7  SyS_epoll_wait ...
 #8  <signal handler called>
 #9  0x00007f0449407923 in ?? ()


It seems that there are still some non-kernel addresses that do not get filtered. Can you help double check?

For example:

crash> gdb bt
#0  crash_setup_regs (newregs=0xffffb5bb4f197938, oldregs=0x0) at ./arch/x86/include/asm/kexec.h:114
#1  0xffffffff8e61e32e in __crash_kexec (regs=regs@entry=0x0) at kernel/crash_core.c:122
#2  0xffffffff8e51a64d in panic (fmt=fmt@entry=0xffffffff8fa51609 "sysrq triggered crash\n") at kernel/panic.c:366
#3  0xffffffff8ec21f86 in sysrq_handle_crash (key=<optimized out>) at drivers/tty/sysrq.c:154
#4  0xffffffff8ec22550 in __handle_sysrq (key=<optimized out>, check_mask=check_mask@entry=false) at drivers/tty/sysrq.c:612
#5  0xffffffff8ec22bf5 in write_sysrq_trigger (file=<optimized out>, buf=<optimized out>, count=2, ppos=<optimized out>) at drivers/tty/sysrq.c:1183
#6  0xffffffff8e935ae5 in pde_write (ppos=<optimized out>, count=<optimized out>, buf=<optimized out>, file=0xffffb5bb4f197938, pde=0xffff98338b78e0c0) at fs/proc/inode.c:334
#7  proc_reg_write (file=0xffffb5bb4f197938, buf=0x0, count=1, ppos=0x0) at fs/proc/inode.c:346
#8  0xffffffff8e88d382 in vfs_write (file=file@entry=0xffff98338b789200, buf=buf@entry=0x5614d58a22c0 <error: Cannot access memory at address 0x5614d58a22c0>, count=count@entry=2, pos=pos@entry=0xffffb5bb4f197b78) at fs/read_write.c:588
#9  0xffffffff8e88d9ff in ksys_write (fd=<optimized out>, buf=0x5614d58a22c0 <error: Cannot access memory at address 0x5614d58a22c0>, count=2) at fs/read_write.c:643
#10 0xffffffff8f124429 in do_syscall_x64 (nr=1, regs=0xffffb5bb4f197f58) at arch/x86/entry/common.c:52
#11 do_syscall_64 (regs=0xffffb5bb4f197f58, nr=1) at arch/x86/entry/common.c:83
#12 0xffffffff8f20012b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:121
#13 0x00007f9a147f69e0 in ?? ()

The frame #13 looks like a non-kernel address.


Thanks

Lianbo
 
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Lianbo Jiang <lijiang@redhat.com>
Cc: HAGIO KAZUHITO(萩尾 一仁) <k-hagio-ab@nec.com>
Cc: Tao Liu <ltao@redhat.com>
Cc: Alexey Makhalov <alexey.makhalov@broadcom.com>
Signed-off-by: Tao Liu <ltao@redhat.com>
---
 defs.h          |  1 +
 gdb-10.2.patch  | 26 ++++++++++++++++++++++++++
 gdb_interface.c |  6 ++++++
 3 files changed, 33 insertions(+)

diff --git a/defs.h b/defs.h
index 012ffdc..c0e6a29 100644
--- a/defs.h
+++ b/defs.h
@@ -7902,6 +7902,7 @@ extern unsigned char *gdb_prettyprint_arrays;
 extern unsigned int *gdb_repeat_count_threshold;
 extern unsigned char *gdb_stop_print_at_null;
 extern unsigned int *gdb_output_radix;
+int is_kvaddr(ulong);

 /*
  *  gdb/top.c
diff --git a/gdb-10.2.patch b/gdb-10.2.patch
index 0bed96a..3ed40c0 100644
--- a/gdb-10.2.patch
+++ b/gdb-10.2.patch
@@ -16171,3 +16171,29 @@ exit 0
  }

  /*
+--- gdb-10.2/gdb/frame.c.orig
++++ gdb-10.2/gdb/frame.c
+@@ -2331,6 +2331,10 @@ inside_entry_func (frame_info *this_frame)
+    This function should not contain target-dependent tests, such as
+    checking whether the program-counter is zero.  */
+
++#ifdef CRASH_MERGE
++extern "C" int is_kvaddr(ulong);
++#endif
++
+ struct frame_info *
+ get_prev_frame (struct frame_info *this_frame)
+ {
+@@ -2353,7 +2357,11 @@ get_prev_frame (struct frame_info *this_frame)
+     get_frame_id (this_frame);
+
+   frame_pc_p = get_frame_pc_if_available (this_frame, &frame_pc);
+-
++#ifdef CRASH_MERGE
++  if (!is_kvaddr(frame_pc)) {
++        return NULL;
++  }
++#endif
+   /* tausq/2004-12-07: Dummy frames are skipped because it doesn't make much
+      sense to stop unwinding at a dummy frame.  One place where a dummy
+      frame may have an address "inside_main_func" is on HPUX.  On HPUX, the
diff --git a/gdb_interface.c b/gdb_interface.c
index b13d5fd..e76ecc6 100644
--- a/gdb_interface.c
+++ b/gdb_interface.c
@@ -947,6 +947,12 @@ gdb_lookup_module_symbol(ulong addr, ulong *offset)
        }
 }

+int
+is_kvaddr(ulong addr)
+{
+       return IS_KVADDR(addr);
+}
+
 /*
  *  Used by gdb_interface() to catch gdb-related errors, if desired.
  */
--
2.40.1