[Crash-utility] Re: [PATCH] Fix a "Bus error" issue caused by 'crash --osrelease' or crash loading

Thursday, 4 July 2024

On Fri, Jul 5, 2024 at 7:29 AM Tao Liu <ltao(a)redhat.com&gt; wrote:

...
 Hi Lianbo,

 On Mon, Jul 1, 2024 at 9:15 PM lijiang <lijiang(a)redhat.com&gt; wrote:
 >
 > Hi, Tao
 > On Thu, Jun 27, 2024 at 12:08 PM Tao Liu <ltao(a)redhat.com&gt; wrote:
 >>
 >> Hi Lianbo,
 >>
 >> Thanks for the patch.
 >>
 >> On Fri, Jun 14, 2024 at 3:00 PM Lianbo Jiang <lijiang(a)redhat.com&gt;
 wrote:
 >> >
 >> > Sometimes, in a production environment, there are still some vmcores
 >> > that are incomplete, such as partial header or the data is corrupted.
 >> > When crash tool attempts to parse such vmcores, it may fail as below:
 >> >
 >> >   $ ./crash --osrelease vmcore
 >> >   Bus error (core dumped)
 >> >
 >> > or
 >> >
 >> >   $ crash vmlinux vmcore
 >> >   ...
 >> >   Bus error (core dumped)
 >> >  $
 >> >
 >> > The gdb sees that crash tool reads out a null bitmap from the header
 in
 >> > this vmcore, when executing memcpy(), emits a SIGBUS error as below:
 >> >
 >> >   $ gdb /home/lijiang/src/crash/crash /tmp/core.126301
 >> >   Core was generated by `./crash --osrelease
 /home/lijiang/src/39317/vmcore'.
 >> >   Program terminated with signal SIGBUS, Bus error.
 >> >   #0  __memcpy_evex_unaligned_erms () at
 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831
 >> >   831             LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5),
 %VMM(6), %VMM(7))
 >> >   (gdb) bt
 >> >   #0  __memcpy_evex_unaligned_erms () at
 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831
 >> >   #1  0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f
 "/home/lijiang/src/39317/vmcore") at diskdump.c:820
 >> >   #2  0x0000000000651cf3 in is_diskdump (file=0x7ffc59ddff5f
 "/home/lijiang/src/39317/vmcore") at diskdump.c:1042
 >> >   #3  0x0000000000502ac9 in get_osrelease (dumpfile=0x7ffc59ddff5f
 "/home/lijiang/src/39317/vmcore") at main.c:1938
 >> >   #4  0x00000000004fb2e8 in main (argc=3, argv=0x7ffc59dde3a8) at
 main.c:271
 >> >   (gdb) frame 1
 >> >   #1  0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f
 "/home/lijiang/src/39317/vmcore") at diskdump.c:820
 >> >   820                   memcpy(dd->dumpable_bitmap, dd->bitmap +
 bitmap_len/2,
 >> >   (gdb) p dd->dumpable_bitmap
 >> >   $1 = 0x7f8e89800010 ""
 >> >   (gdb) p dd->bitmap
 >> >   $2 = 0x7f8e87e09000 ""
 >> >   (gdb) p dd->bitmap + bitmap_len/2
 >> >   $3 = 0x7f8e88a17000 ""
 >> >   (gdb) p *(char*)(dd->bitmap+bitmap_len/2)
 >> >   $4 = 0 '\000'
 >> >   (gdb) p bitmap_len/2
 >> >   $5 = 12640256
 >> >
 >> > Let's add a sanity check for such cases to avoid causing a SIGBUS
 error.
 >> >
 >> I didn't really understand the reason why the SIGBUS error occurs. Is
 >> it because bitmap_len/2 is too large(12M), which exceeded the
 >> dd->bitmap memory range?
 >
 >
 > I did not see any limitations on the bitmap memory range.
 >
 > Looks like an unaligned memory address was accessed, please see here:
 >
 > #0  __memcpy_evex_unaligned_erms () at
 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831
 >   831             LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5),
 %VMM(6), %VMM(7))
 >
 > Maybe the vmcore is corrupted, anyway I have never seen it before this
 issue.
 >
 >>
 >> Why *(char*)(dd->bitmap+bitmap_len/2) == '\0' will be considered as
an
 >> error, do we expect a value like "0xff" here?
 >
 >
 > No.
 > If it is a null string, no need to copy it(or it may be an exceptional
 string because of corrupted data).
 >
 I still didn't understand the check here. Assume we have a bitmap like
 the following. The above code only checks if the starting byte is 0.

No, as above mentioned,  the *(char*)(dd->bitmap+bitmap_len/2) == '\0' is
checking if it is a null string, not a 0 value. They are different.

What about if there are 1s after the leading 0? Do these be considered
...
 as valid bitmap and should they be copied?
 ... 0 1 1 0 1 1 0 1 1 1 1
    ^

This is not a null string.  The value of checking
*(char*)(dd->bitmap+bitmap_len/2) == '\0' is 'false', not
'true'.

...
 ... len/2

 I understand that you want to skip the unnecessary memcpy to bypass

As I mentioned in the last email, If it is a null string('\0'), no need to
copy it. Let's differentiate between 0 and '\0', they are not the same.

Thanks
Lianbo

the SIGBUS error by setting a check condition. I'm worrying if the
...
 check may filter out those valid bitmaps.

 Thanks,
 Tao Liu

 >>
 >>
 >> I guess what we want here is to handle SIGBUS errors nicely right? why
 >> don't we add a SIGBUS handler and process it directly?
 >
 >
 > This may prevent a core dump file from being generated.
 > And users can easily report it with a core dump file, sometimes It's
 hard to reproduce for us.
 >
 > Thanks
 > Lianbo
 >
 >>
 >> Thanks,
 >> Tao Liu
 >>
 >> > With the patch:
 >> >   $ crash -s vmlinux vmcore
 >> >   crash: vmcore: not a supported file format
 >> >   ...
 >> >   Enter "crash -h" for details.
 >> >
 >> >   $ crash --osrelease vmcore
 >> >   unknown
 >> >
 >> > Reported-by: Buland Kumar Singh <bsingh(a)redhat.com&gt;
 >> > Signed-off-by: Lianbo Jiang <lijiang(a)redhat.com&gt;
 >> > ---
 >> >  diskdump.c | 6 ++++--
 >> >  1 file changed, 4 insertions(+), 2 deletions(-)
 >> >
 >> > diff --git a/diskdump.c b/diskdump.c
 >> > index 1f7118cacfc6..c31eab01aa05 100644
 >> > --- a/diskdump.c
 >> > +++ b/diskdump.c
 >> > @@ -814,10 +814,12 @@ restart:
 >> >                 madvise(dd->bitmap, bitmap_len, MADV_WILLNEED);
 >> >         }
 >> >
 >> > -       if (dump_is_partial(header))
 >> > +       if (dump_is_partial(header)) {
 >> > +               if (*(char*)(dd->bitmap + bitmap_len/2) ==
'\0')
 >> > +                       goto err;
 >> >                 memcpy(dd->dumpable_bitmap, dd->bitmap +
bitmap_len/2,
 >> >                        bitmap_len/2);
 >> > -       else
 >> > +       } else
 >> >                 memcpy(dd->dumpable_bitmap, dd->bitmap, bitmap_len);
 >> >
 >> >         dd->data_offset
 >> > --
 >> > 2.45.1
 >> > --
 >> > Crash-utility mailing list -- devel(a)lists.crash-utility.osci.io
 >> > To unsubscribe send an email to
 devel-leave(a)lists.crash-utility.osci.io
 >> > https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
 >> > Contribution Guidelines: https://github.com/crash-utility/crash/wiki
 >>

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[Crash-utility] Re: [PATCH] Fix a "Bus error" issue caused by 'crash --osrelease' or crash loading