Hi Aditya,
On Mon, Feb 12, 2024 at 4:11 PM Aditya Gupta <adityag(a)linux.ibm.com> wrote:
Hi Tao Liu,
On Fri, Feb 09, 2024 at 09:15:26PM +0800, Tao Liu wrote:
> On Fri, Feb 02, 2024 at 07:04:14AM +0000, HAGIO KAZUHITO(萩尾 一仁) wrote:
> > Hi Aditya,
> >
> > Thank you for the work, it looks nicely done to me, except for the "info
> > threads" issue and the style of the gdb patch.
> >
> > Hi Tao Liu,
> >
> > I saw that you were interested in support for other architectures. I'd
> > like to test/support this also on x86_64 at the same time as ppc64 if
> > possible. Do you have any trial patch or plan?
>
> Hi Kazu & Aditya,
>
> Sorry for the delay. You can access my trial patch by
https://github.com/liutgnu/crash-dev. There are some known issue:
i tested it on a x86_64 and ppc64 vmcore i had, with these sequence of
commands (with some more 'gdb up' 'gdb down'):
(crash) set
(crash) set gdb on
gdb> thread
gdb> bt
gdb> info threads
gdb> info threads
gdb> info locals
gdb> info variables irq_rover_lock
gdb> info args
gdb> thread 2
gdb> set gdb off
(crash) set
(crash) set -c 6
(crash) gdb thread
(crash) bt
(crash) gdb bt
(crash) frame
(crash) gdb up
(crash) gdb down
(crash) info locals
had to rebase the patch onto my v8, the tree is available here:
https://github.com/adi-g15-ibm/crash/tree/tmp-test-branch-20201
>
> 1) Some vmcores will stack unwind fail:
> $ ./crash /var/crash/127.0.0.1-2023-11-10-18\:27\:30/vmcore ~/vmlinux
> KERNEL: /root/vmlinux [TAINTED]
> DUMPFILE: /var/crash/127.0.0.1-2023-11-10-18:27:30/vmcore [PARTIAL DUMP]
> CPUS: 1
> DATE: Fri Nov 10 18:27:26 CST 2023
> UPTIME: 00:10:49
> LOAD AVERAGE: 0.07, 0.11, 0.08
> TASKS: 133
> NODENAME:
ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com
> RELEASE: 5.14.0-39.el9.x86_64
> VERSION: #1 SMP PREEMPT Fri Dec 24 00:07:58 EST 2021
> MACHINE: x86_64 (2303 Mhz)
> MEMORY: 4 GB
> PANIC: "Oops: 0002 [#1] PREEMPT SMP NOPTI" (check log for details)
> PID: 22722
> COMMAND: "insmod"
> TASK: ffff973e88408000 [THREAD_INFO: ffff973e88408000]
> CPU: 0
> STATE: TASK_RUNNING (PANIC)
>
> crash> set 1
> >>>> sp:ffffa5ee40013d00 bp:ffff973efbc2c800 ip:ffffffffb726a1b3
> >>>> sp:ffffa5ee40013d00 bp:ffff973efbc2c800 ip:ffffffffb726a1b3
> crash> gdb bt
> #0 0xffffffffb726a1b3 in context_switch (rf=0xffffa5ee40013d00,
next=0xffff973e88408000, prev=0xffff973e801fb280, rq=0xffff973efbc2c800) at
kernel/sched/core.c:4972
> #1 __schedule (sched_mode=0) at kernel/sched/core.c:6253
> #2 0x0004005300000004 in ?? ()
> crash> bt
> PID: 1 TASK: ffff973e801fb280 CPU: 0 COMMAND: "systemd"
> #0 [ffffa5ee40013d30] __schedule at ffffffffb726a1b3
> #1 [ffffa5ee40013d78] schedule at ffffffffb726a553
> #2 [ffffa5ee40013d90] schedule_hrtimeout_range_clock at ffffffffb726f294
> #3 [ffffa5ee40013e10] ep_poll at ffffffffb6bc00c4
> #4 [ffffa5ee40013eb0] do_epoll_wait at ffffffffb6bc01db
> #5 [ffffa5ee40013ee8] __x64_sys_epoll_wait at ffffffffb6bc09b0
> #6 [ffffa5ee40013f38] do_syscall_64 at ffffffffb725ea98
> #7 [ffffa5ee40013f50] entry_SYSCALL_64_after_hwframe at ffffffffb740007c
>
> The stack unwinding failed for "gdb bt", it only unwinded the
"__schedule" function. The similarities for the failing is rsp and rbp which got
from stack pointing to different stack frames:
> >>>> sp:ffffa5ee40013d00 bp:ffff973efbc2c800 ip:ffffffffb726a1b3
> crash> rd ffffa5ee40013d00 32
> ffffa5ee40013d00: ffff973ef8eef4a8 0000000000000000 ....>........... r15, r14
(struct inactive_task_frame)
> ffffa5ee40013d10: ffff973e88408000 ffff973e801fb280 ..@.>.......>... r13,
r12
> ffffa5ee40013d20: ffff973e801fbd98 ffff973efbc2c800 ....>.......>... bx,
bp
> ffffa5ee40013d30: ffffffffb726a1b3 ffff973e00000001 ..&.........>...
ret_addr
> ffffa5ee40013d40: 0004005300000004 f915b72f0f356b00 ....S....k5./...
> ffffa5ee40013d50: ffff973e801fb280 ffff973e801fb280 ....>.......>...
> ffffa5ee40013d60: ffff973e801fb280 ffff973ef8eef4e0 ....>.......>...
> ffffa5ee40013d70: 0000000000000000 ffffffffb726a553 ........S.&.....
> ffffa5ee40013d80: ffff973ef8eef480 ffffa5ee40013e68 ....>...h>.@....
> ffffa5ee40013d90: ffffffffb726f294 ffffffffb6bbf092 ..&.............
> ffffa5ee40013da0: 000055d6d4081de0 00000054b6b4c6fd .....U......T...
> ffffa5ee40013db0: ffff973ef8eef4d0 ffffa5ee40013db8 ....>....=.@....
> ffffa5ee40013dc0: ffffa5ee40013db8 0000000000000000 .=.@............
> ffffa5ee40013dd0: ffff973e00000019 f915b72f0f356b00 ....>....k5./...
> ffffa5ee40013de0: f915b72f0f356b00 ffff973ef8eef480 .k5./.......>...
> ffffa5ee40013df0: ffffa5ee40013e68 ffff973e801fb280 h>.@........>...
>
> And bp is modified as follows:
>
> crash> dis schedule
> 0xffffffffb726a510 <schedule>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
> 0xffffffffb726a515 <schedule+5>: push %rbp
> 0xffffffffb726a516 <schedule+6>: mov %gs:0x16f40,%rbp <<<
> 0xffffffffb726a51f <schedule+15>: push %rbx
> 0xffffffffb726a520 <schedule+16>: mov 0x18(%rbp),%eax
> 0xffffffffb726a523 <schedule+19>: test %eax,%eax
>
> I'm not sure why in this case gdb cannot get the stack unwinded. Other than this
case, the x86_64 stack unwinding works fine according to my test.
>
> 2) May break the original ppc64 patch.
>
> This x86_64 patch is based on the original v7 ppc stack unwinding patch. And it
modified a bit of the original ppc64 patch code. I haven't fully tested the code in
ppc64 arch.
For now it seems to work fine on ppc64, I will go through the code
changes once. By the way, can you post a rebased patch series for x86 ?
I noticed there is some problem with the ppc64 change thread context
support. Currently for x86_64, we use "set <pid>" cmd to switch the
current task context to be the task <pid>'s, then we use cmd "gdb bt"
to view the stack unwinding of task <pid>, by which we can view
arbitrary task's stack unwinding.
However for ppc64 code, there is no "get arbitrary tasks register"
code. E.g. in ppc64_get_cpu_reg():
task = get_active_task(cpu);
tc = task_to_context(task);
So no matter how we set any pid, the registers are always the active
tasks' register, instead of <pid> tasks' register.
It would be simple to get this fixed, however the more challenging
part, which is difficult for me, is how to get the inactive tasks
registers, So we can do similar things as in
x86_64.c:x86_64_get_stack_frame() of
https://github.com/liutgnu/crash-dev?
I see there is a structure as task_struct -> thread_struct -> struct
pt_regs *regs. But sometimes the regs will give NULL value.
Thanks,
Tao Liu
Thanks,
Aditya Gupta
>
> Thanks,
> Tao Liu
>
>
> >
> > Thanks,
> > Kazu
> >
> > On 2024/01/05 16:30, Aditya Gupta wrote:
> > > The Problem:
> > > ============
> > >
> > > Currently crash is unable to show function arguments and local variables,
as
> > > gdb can do. And functionality for moving between frames
('up'/'down') is not
> > > working in crash.
> > >
> > > Crash has 'gdb passthroughs' for things gdb can do, but the gdb
passthroughs
> > > 'bt', 'frame', 'info locals', 'up',
'down' are not working either, due to
> > > gdb not getting the register values from `crash_target::fetch_registers`,
> > > which then uses `machdep->get_cpu_reg`, which is not implemented for
PPC64
> > >
> > > Proposed Solution:
> > > ==================
> > >
> > > Fix the gdb passthroughs by implementing
"machdep->get_cpu_reg" for PPC64.
> > > This way, "gdb mode in crash" will support this feature for both
ELF and
> > > kdump-compressed vmcore formats, while "gdb" would only have
supported ELF
> > > format
> > >
> > > This way other features of 'gdb', such as seeing
> > > backtraces/registers/variables/arguments/local variables, moving up and
> > > down stack frames, can be used with any ppc64 vmcore, irrespective of
> > > being ELF format or kdump-compressed format.
> > >
> > > Note: This doesn't support live debugging on ppc64, since registers
are not
> > > available to be read
> > >
> > > Implications on Architectures:
> > > ====================================
> > >
> > > No architecture other than PPC64 has been affected, other than in case of
> > > 'frame' command
> > >
> > > As mentioned in patch #2, since frame will not be prohibited, so it will
print:
> > >
> > > crash> frame
> > > #0 <unavailable> in ?? ()
> > >
> > > Instead of before prohibited message:
> > >
> > > crash> frame
> > > crash: prohibited gdb command: frame
> > >
> > > Major change will be in 'gdb mode' on PPC64, that it will print
the frames, and
> > > local variables, instead of failing with errors showing no frame, or
showing
> > > that couldn't get PC, it will be able to give all this information.
> > >
> > > Testing:
> > > ========
> > >
> > > Git tree with this patch series applied:
> > >
https://github.com/adi-g15-ibm/crash/tree/stack-unwind-v6
> > >
> > > To test various gdb passthroughs:
> > >
> > > (crash) set
> > > (crash) set gdb on
> > > gdb> thread
> > > gdb> bt
> > > gdb> info threads
> > > gdb> info threads
> > > gdb> info locals
> > > gdb> info variables irq_rover_lock
> > > gdb> info args
> > > gdb> thread 2
> > > gdb> set gdb off
> > > (crash) set
> > > (crash) set -c 6
> > > (crash) gdb thread
> > > (crash) bt
> > > (crash) gdb bt
> > > (crash) frame
> > > (crash) up
> > > (crash) down
> > > (crash) info locals
> > >
> > > Known Issues:
> > > =============
> > >
> > > 1. In gdb mode, 'bt' might fail to show backtrace in few vmcores
collected
> > > from older kernels. This is a known issue due to register mismatch,
and
> > > its fix has been merged upstream:
> > >
> > > This can also cause some 'invalid kernel virtual address'
errors during gdb
> > > unwinding the stack registers
> > >
> > > Commit:
https://github.com/torvalds/linux/commit/b684c09f09e7a6af3794d4233ef78581...
> > >
> > > Fixing GDB passthroughs on other architectures
> > > ==============================================
> > >
> > > Much of the work for making gdb passthroughs like 'gdb bt',
'gdb
> > > thread', 'gdb info locals' etc. has been done by the patches
introducing
> > > 'machdep->get_cpu_reg' and this series fixing some issues in
that.
> > >
> > > Other architectures should be able to fix these gdb functionalities by
> > > simply implementing 'machdep->get_cpu_reg (cpu, regno, ...)'.
> > >
> > > The reasoning behind that has been explained with a diagram in commit
> > > description of patch #1
> > >
> > > I will assist with my findings/observations fixing it on ppc64 whenever
needed.
> > >
> > > Changelog:
> > > ==========
> > >
> > > V6:
> > > + changes in patch #5: fix bug introduced in v5 that caused initial gdb
thread
> > > to be thread 1
> > >
> > > V5:
> > > + changes in patch #1: made ppc64_get_cpu_reg static, and remove
unreachable
> > > code
> > > + changes in patch #3: fixed typo 'ppc64_renum' instead of
'ppc64_regnum',
> > > remove unneeded if condition
> > > + changes in patch #5: implement refresh regcache on per thread, instead
of all
> > > threads at once
> > >
> > > V4:
> > > + fix segmentation fault in live debugging (change in patch #1)
> > > + mention live debugging not supported in cover letter and patch #1
> > > + fixed some checkpatch warnings (change in patch #5)
> > >
> > > V3:
> > > + default gdb thread will be the crashing thread, instead of being
> > > thread '0'
> > > + synchronise crash cpu and gdb thread context
> > > + fix bug in gdb_interface, that replaced gdb's output stream, losing
> > > output in some cases, such as info threads and extra output in info
> > > variables
> > > + fix 'info threads'
> > >
> > > RFC V2:
> > > - removed patch implementing 'frame', 'up',
'down' in crash
> > > - updated the cover letter by removing the mention of those commands
other
> > > than the respective gdb passthrough
> > >
> > > Aditya Gupta (5):
> > > ppc64: correct gdb passthroughs by implementing
machdep->get_cpu_reg
> > > remove 'frame' from prohibited commands list
> > > synchronise cpu context changes between crash/gdb
> > > fix gdb_interface: restore gdb's output streams at end of
> > > gdb_interface
> > > fix 'info threads' command
> > >
> > > crash_target.c | 44 ++++++++++++++++
> > > defs.h | 130 +++++++++++++++++++++++++++++++++++++++++++++++-
> > > gdb-10.2.patch | 110 +++++++++++++++++++++++++++++++++++++++-
> > > gdb_interface.c | 2 +-
> > > kernel.c | 47 +++++++++++++++--
> > > ppc64.c | 95 +++++++++++++++++++++++++++++++++--
> > > task.c | 14 ++++++
> > > tools.c | 2 +-
> > > 8 files changed, 434 insertions(+), 10 deletions(-)
> > >
>