On 2023/09/07 19:54, Daisuke Hatayama (Fujitsu) wrote:
Hagio-san,
I'm now facing the issue on RHEL8 and RHEL9 that crash dump files are
corrupted although makedumpfile terminated in successful status, in
the issue which I found the issue this patch tries to fix.
I made a program to check where is corrupted in detail in such crash
dump file. The following is example:
# ./validatekdump -v vmcore
disk dump header:
signature: "KDUMP ^F"
header_version: 6
block_size: 4096
sub_hdr_size: 18
bitmap_blocks: 1056
total_ram_blocks: 0
device_blocks: 0
written_blocks: 0
kdump sub header:
phys_base: 0000000172200000
dump_level: 31
start_pfn_64: 0
end_pfn_64: 0
max_mapnr_64: 17301504
Total ram pages: 16597666
Total dumpable pages: 767664
PFN OFFSET SIZE FLAGS PAGE_FLAGS
1024 22827136 4096 00000000 0000000000000000 valid
1025 22827136 4096 00000000 0000000000000000 valid
...strip...
1102265 54518942 143 00000000 00000000033cd80d invalid
1102266 60779170 42127 00000000 00000000033cd80d invalid
1102267 61070498 38287 00000000 00000000033cd80d invalid
In my evaluation, disk dump header, kdump sub header, two bitmaps
appears consistent. The corrupted part begins in the middle of page
descriptor table. Page descriptors are corrupted in various patterns,
for example:
- offset is larger than size of the crash dump file.
- size is larger than 4096 bytes.
- size is 0.
- size is smaller than 4096 bytes but flags doesn't hold the value representing
compression.
- flags doesn't consist of the values defined by DUMP_DH_XXXXX macros.
- page_flags is not 0 although it is unused.
Corrupted part appears to reach data portion. Some crash dump file
contains valid 49489 page descriptors whose data portion is compressed
with LZO according to the flags member, but only 7 of them are
decompressed successfully.
Unfortunately, information to investigate this issue is just these
crash dump files. I don't have machines where the issue is
reproducible. So, while I'm making effort to reproduce the issue, I'm
checking known similar issues on related components where the root
cause of this issue possibly lies in, such as makedumpfile,
filesystem, disk driver, or any hardware issue.
I looked into git log but I didn't find such commits.
Also, I have never seen issues resulting in this kind of data corruption on
makedumpfile.
Then, would you tell me if you know some past or current issues on
makedumpfile that are similar to the above situation?
Hmm, I've tested makedumpfile every week with the latest upstream and
RHEL8/9 kernels on various machines, but I've also never seen such
corruption, IIRC.
Hope you find a reproducer..
Thanks,
Kazu
>
> Thanks.
> HATAYAMA, Daisuke
>
>> ________________________________________
>> From: Crash-utility <crash-utility-bounces(a)redhat.com> on behalf of
HATAYAMA Daisuke <d.hatayama(a)fujitsu.com>
>> Sent: Thursday, September 7, 2023 18:38
>> To: crash-utility(a)redhat.com
>> Subject: [Crash-utility] [PATCH] diskdump: fix read_diskdump() returns
successfully for illegal 0-size page descriptors
>>
>> read_diskdump() returns successfully for illegal 0-size page
>> descriptors. Page descriptors are illegal if their size member holds 0
>> because makedumpfile never puts 0 there because any data never result
>> in 0 byte by compression. If page descriptors hold 0 in size member,
>> it means the crash dump file is corrupted for some reason.
>>
>> The root cause of this is that sanity check of function cache_page()
>> doesn't focus on such 0-size page descriptors. Then, the 0-size page
>> descriptor is passed to pread(), pread() immediately returns 0
>> successfully because read data is 0 byte, and then read_diskdump()
>> returns successfully.
>>
>> To fix this issue, let the sanity check take into account such 0-size
>> page descriptors and read_diskdump() result in READ_ERROR.
>>
>> Signed-off-by: HATAYAMA Daisuke <d.hatayama(a)fujitsu.com>
>> ---
>> diskdump.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/diskdump.c b/diskdump.c
>> index 2c284ff..2be7cc7 100644
>> --- a/diskdump.c
>> +++ b/diskdump.c
>> @@ -1210,7 +1210,7 @@ cache_page(physaddr_t paddr)
>> return ret;
>>
>> /* sanity check */
>> - if (pd.size > block_size)
>> + if (pd.size > block_size || !pd.size)
>> return READ_ERROR;
>>
>> /* read page data */
>> --
>> 2.25.1
>>
>> --
>> Crash-utility mailing list
>> Crash-utility(a)redhat.com
>>
https://listman.redhat.com/mailman/listinfo/crash-utility
>> Contribution Guidelines:
https://github.com/crash-utility/crash/wiki