Kernel dump file access library (repost)
by Petr Tesarik
[Seems, I used the wrong address; sorry for possible double posting.]
Hi all,
during this year's SUSE HackWeek, my colleague started work on enabling
kernel core files in gdb. I realized that there would be at least four
different programs implementing read access to kernel dump files:
1. the crash utility
2. makedumpfile (when re-filtering)
3. kdumpid (my project to get kernel version from a dump file)
4. gdb-kdump (started by my colleague during HackWeek)
At this point, I felt that's too much re-inventing the wheel again and
again, so I took my current code from kdumpid and adapted it as a
library that can be used by everybody:
https://github.com/ptesarik/libkdumpfile
In its current shape, it's usable, but far from complete.
Things that work already:
- identify kdump file format
- parsed meta-information from the header
- open ELF, diskdump, makedumpfile, LKCD
- read data by physical address (incl. Xen Dom0)
- read data by Xen machine address
Things still on my TODO list:
- more formats: sadump, kvmdump, libvirt, xc_core, xc_save
- determine phys_base in ELF files
- determine kernel release if not found in headers
Ideally, I would like to replace all current implementations with this
library, so if a new file format appears, or a new feature is added to
one of the files, it can be immediately used by all kdump-related tools.
Please let me know what you think.
Oh, and if you're developing such a tool, let me know which features
should be added.
Regards,
Petr Tesarik
10 years, 1 month
[PATCH] s390: support irq command via generic_dump_irq
by Sebastian Ott
Hi,
here is a simple patch to rudimentary support the irq command on s390.
Nothing special like irq statistics, just the plain list of irqs. Also
this will only work on recent kernels. Old kernels (without
GENERIC_HARDIRQ support) will print "cannot determine number of IRQs".
Regards,
Sebastian
>From aa13aff5450686ac4438d771596e0faa041aa454 Mon Sep 17 00:00:00 2001
From: Sebastian Ott <sebott(a)linux.vnet.ibm.com>
Date: Fri, 14 Nov 2014 13:52:54 +0100
Subject: [PATCH] s390: support irq command via generic_dump_irq
Signed-off-by: Sebastian Ott <sebott(a)linux.vnet.ibm.com>
---
kernel.c | 6 ------
s390.c | 25 ++++++++++++-------------
s390x.c | 24 +++++++++++-------------
3 files changed, 23 insertions(+), 32 deletions(-)
diff --git a/kernel.c b/kernel.c
index 1cb0967..da1e48e 100644
--- a/kernel.c
+++ b/kernel.c
@@ -5575,9 +5575,6 @@ cmd_irq(void)
return;
case 'u':
- if (machine_type("S390") || machine_type("S390X"))
- command_not_supported();
-
pc->curcmd_flags |= IRQ_IN_USE;
if (kernel_symbol_exists("no_irq_chip"))
pc->curcmd_private = (ulonglong)symbol_value("no_irq_chip");
@@ -5633,9 +5630,6 @@ cmd_irq(void)
if (argerrs)
cmd_usage(pc->curcmd, SYNOPSIS);
- if (machine_type("S390") || machine_type("S390X"))
- command_not_supported();
-
if ((nr_irqs = machdep->nr_irqs) == 0)
error(FATAL, "cannot determine number of IRQs\n");
diff --git a/s390.c b/s390.c
index 45da7c4..7740fe3 100644
--- a/s390.c
+++ b/s390.c
@@ -57,7 +57,6 @@ static int s390_translate_pte(ulong, void *, ulonglong);
static ulong s390_processor_speed(void);
static int s390_eframe_search(struct bt_info *);
static void s390_back_trace_cmd(struct bt_info *);
-static void s390_dump_irq(int);
static void s390_get_stack_frame(struct bt_info *, ulong *, ulong *);
static int s390_dis_filter(ulong, char *, unsigned int);
static void s390_cmd_mach(void);
@@ -146,9 +145,18 @@ s390_init(int when)
break;
case POST_GDB:
- machdep->nr_irqs = 0; /* TBD */
+
+ if (symbol_exists("irq_desc"))
+ ARRAY_LENGTH_INIT(machdep->nr_irqs, irq_desc,
+ "irq_desc", NULL, 0);
+ else if (kernel_symbol_exists("nr_irqs"))
+ get_symbol_data("nr_irqs", sizeof(unsigned int),
+ &machdep->nr_irqs);
+ else
+ machdep->nr_irqs = 0;
+
machdep->vmalloc_start = s390_vmalloc_start;
- machdep->dump_irq = s390_dump_irq;
+ machdep->dump_irq = generic_dump_irq;
if (!machdep->hz)
machdep->hz = HZ;
machdep->section_size_bits = _SECTION_SIZE_BITS;
@@ -194,7 +202,7 @@ s390_dump_machdep_table(ulong arg)
fprintf(fp, " uvtop: s390_uvtop()\n");
fprintf(fp, " kvtop: s390_kvtop()\n");
fprintf(fp, " get_task_pgd: s390_get_task_pgd()\n");
- fprintf(fp, " dump_irq: s390_dump_irq()\n");
+ fprintf(fp, " dump_irq: generic_dump_irq()\n");
fprintf(fp, " get_stack_frame: s390_get_stack_frame()\n");
fprintf(fp, " get_stackbase: generic_get_stackbase()\n");
fprintf(fp, " get_stacktop: generic_get_stacktop()\n");
@@ -954,15 +962,6 @@ s390_get_stack_frame(struct bt_info *bt, ulong *eip, ulong *esp)
}
/*
- * cmd_irq() is not implemented for s390.
- */
-static void
-s390_dump_irq(int irq)
-{
- error(FATAL, "s390_dump_irq: TBD\n");
-}
-
-/*
* Filter disassembly output if the output radix is not gdb's default 10
*/
static int
diff --git a/s390x.c b/s390x.c
index 5bd7a81..7d1310f 100644
--- a/s390x.c
+++ b/s390x.c
@@ -104,7 +104,6 @@ static int s390x_translate_pte(ulong, void *, ulonglong);
static ulong s390x_processor_speed(void);
static int s390x_eframe_search(struct bt_info *);
static void s390x_back_trace_cmd(struct bt_info *);
-static void s390x_dump_irq(int);
static void s390x_get_stack_frame(struct bt_info *, ulong *, ulong *);
static int s390x_dis_filter(ulong, char *, unsigned int);
static void s390x_cmd_mach(void);
@@ -412,9 +411,17 @@ s390x_init(int when)
break;
case POST_GDB:
- machdep->nr_irqs = 0; /* TBD */
+ if (symbol_exists("irq_desc"))
+ ARRAY_LENGTH_INIT(machdep->nr_irqs, irq_desc,
+ "irq_desc", NULL, 0);
+ else if (kernel_symbol_exists("nr_irqs"))
+ get_symbol_data("nr_irqs", sizeof(unsigned int),
+ &machdep->nr_irqs);
+ else
+ machdep->nr_irqs = 0;
+
machdep->vmalloc_start = s390x_vmalloc_start;
- machdep->dump_irq = s390x_dump_irq;
+ machdep->dump_irq = generic_dump_irq;
if (!machdep->hz)
machdep->hz = HZ;
machdep->section_size_bits = _SECTION_SIZE_BITS;
@@ -462,7 +469,7 @@ s390x_dump_machdep_table(ulong arg)
fprintf(fp, " uvtop: s390x_uvtop()\n");
fprintf(fp, " kvtop: s390x_kvtop()\n");
fprintf(fp, " get_task_pgd: s390x_get_task_pgd()\n");
- fprintf(fp, " dump_irq: s390x_dump_irq()\n");
+ fprintf(fp, " dump_irq: generic_dump_irq()\n");
fprintf(fp, " get_stack_frame: s390x_get_stack_frame()\n");
fprintf(fp, " get_stackbase: generic_get_stackbase()\n");
fprintf(fp, " get_stacktop: generic_get_stacktop()\n");
@@ -1413,15 +1420,6 @@ s390x_get_stack_frame(struct bt_info *bt, ulong *eip, ulong *esp)
}
/*
- * cmd_irq() is not implemented for s390x.
- */
-static void
-s390x_dump_irq(int irq)
-{
- error(FATAL, "s390x_dump_irq: TBD\n");
-}
-
-/*
* Filter disassembly output if the output radix is not gdb's default 10
*/
static int
--
1.8.5.5
10 years, 1 month
[ANNOUNCE] crash version 7.0.9 is available
by Dave Anderson
Download from: http://people.redhat.com/anderson
or
https://github.com/crash-utility/crash/releases
The master branch serves as a development branch that will contain all
patches that are queued for the next release:
$ git clone git://github.com/crash-utility/crash.git
Changelog:
- Fix the CPU timer and clock comparator output for the "bt -a" command
on S390X machines. The output of CPU timer and clock comparator has
always been incorrect because:
- We added S390X_WORD_SIZE (8) instead of 4 to get the second word
- We did not left shift the clock comparator by 8
The fix gets the complete 64 bit values and by shifting the clock
comparator correctly.
(holzheu(a)linux.vnet.ibm.com)
- Add "/lib/modules/<version>/build" to the list of directories that
are searched for the currently-running kernel on live systems. This
will automatically locate the vmlinux namelist for kernels that were
locally installed with "make modules_install install".
(lrintel(a)redhat.com)
- Addressed 3 Coverity Scan issues:
(1) task.c: initialize the "curr" and "curr_my_q" variables in the
dump_tasks_in_task_group_cfs_rq() function.
(2) ramdump.c: make the "rd" and "len" return values from read()
and write() calls in write_elf() to be ssize_t types.
(3) cmdline.c: make the parsed PATH string buffer equal to the size
of the PATH string + 1 to prevent a possible buffer overflow
when a command line starts with a "!".
(anderson(a)redhat.com)
- Fix for the one-time (dumpfile), or as-required (live system),
gathering of tasks from the kernel pid_hash[] in 2.6.24 and later
kernels. Without the patch, if an entry in a pid_hash[] chain is
not related to the "init_pid_ns" pid_namespace structure, any
remaining entries in the hlist chain are skipped.
(vvs(a)parallels.com)
- Update the "extensions/snap.mk" file to allow the "snap.so" extension
module to be built outside of a crash source tree on a ppc64le PPC64
little-endian host. Without the patch, "make -f snap.mk" would fail
to compile, indicating "gcc: error: macro name missing after '-D'"
(anderson(a)redhat.com)
- Improve the method for determining whether a 32-bit ARM vmlinux is
an LPAE enabled kernel by first checking whether CONFIG_ARM_LPAE
exists in the vmcoreinfo data, and if it does not, by then checking
whether the next higher symbol above "swapper_pg_dir" is 0x5000 bytes
higher in value.
(sdu.liu(a)huawei.com)
- Fix "defs.h" for building extension modules outside of the crash
utility source tree on PPC and PPC64 machines. Without the patch,
both PPC and PPC64 will get #define'd if the extension module build
procedure does not #define one or the other, which in turn causes
multiple conflicting declarations.
(anderson(a)redhat.com)
- Fix for the "ps" command performance degradation patch the was
introduced in crash-7.0.8. Without this patch, it is possible that
the "ps" command may fail prematurely with the error message
"ps: bsearch for tgid failed: task: <address> tgid: <number>"
when running on a live system or against a "live" dumpfile.
(panfy.fnst(a)cn.fujitsu.com)
- Set the 32-bit ARM HZ value to a default value of 100 if the kernel
was not configured with CONFIG_IKCONFIG. Without the patch, the
initial system banner and the "sys" command show "UPTIME: (cannot
calculate: unknown HZ value)", the "ps -t" option shows "RUN TIME:
(cannot calculate: unknown HZ value)", and the "timer -r" option
kills the crash session with a floating point exception.
(hukeping(a)huawei.com)
- Fix the error message displayed if the vmlinux or vmcore file is
not the same endian as the crash utility binary. Without the patch
the filename is shown with the incorrect/opposite endian type.
(hukeping(a)huawei.com)
- Update the "ps" command's "ST" task state display to recognize the
TASK_PARKED state in Linux 3.9 and later kernels. Without the patch,
the command's "ST" column entry for parked tasks shows "??". The
state column will now show "PA", and the foreach command will accept
"PA" as a "state" argument.
(anderson(a)redhat.com)
- Fortify the protection against the use of an invalid/corrupted
CONFIG_SLAB kmem_cache per-cpu array_cache.limit value during
session initialization. In a recently seen vmcore, several of the
array_cache.limit values were corrupted such that they were stored
as negative values, which in turn caused the "kmem -[sS]" options
to fail immediately with a dump of the internal memory buffer
allocation statistics and the error message "kmem: cannot allocate
any more memory!".
(anderson(a)redhat.com)
- Implement a new "offline" internal crash variable that can be set to
either "show" (the default) or "hide". When set to "hide", certain
command output associated with offline cpus will be hidden from view,
and the output will indicate that the cpu is "[OFFLINE]". The new
variable can be set during invocation on the crash command line via
the option "--offline [show|hide]". During runtime, or in a .crashrc
or other crash input file, the variable can be set by entering
"set offline [show|hide]". The commands or options that are affected
when the variable is set to "hide" are as follows:
o On X86_64 machines, the "bt -E" option will not search exception
stacks associated with offline cpus.
o On X86_64 machines, the "mach" command will append "[OFFLINE]"
to the addresses of IRQ and exception stacks associated with
offline cpus.
o On X86_64 machines, the "mach -c" command will not display the
cpuinfo_x86 data structure associated with offline cpus.
o The "help -r" option has been fixed so as to not attempt to
display register sets of offline cpus from ELF kdump vmcores,
compressed kdump vmcores, and ELF kdump clones created by
"virsh dump --memory-only".
o The "bt -c" option will not accept an offline cpu number.
o The "set -c" option will not accept an offline cpu number.
o The "irq -s" option will not display statistics associated with
offline cpus.
o The "timer" command will not display hrtimer data associated
with offline cpus.
o The "timer -r" option will not display hrtimer data associated
with offline cpus.
o The "ptov" command will append "[OFFLINE]" when translating a
per-cpu address offset to a virtal address of an offline cpu.
o The "kmem -o" option will append "[OFFLINE]" to the base per-cpu
virtual address of an offline cpu.
o The "kmem -S" option in CONFIG_SLUB kernels will not display
per-cpu data associated with offline cpus.
o When a per-cpu address reference is passed to the "struct"
command, the data structure will not be displayed for offline
cpus.
o When a per-cpu symbol and cpu reference is passed to the "p"
command, the data will not be displayed for offline cpus.
o When the "ps -[l|m]" option is passed the optional "-C [cpus]"
option, the tasks queued on offline cpus are not shown.
o The "runq" command and the "runq [-t/-m/-g/-d]" options will not
display runqueue data for offline cpus.
o The "ps" command will replace the ">" active task indicator to
a "-" for offline cpus.
The initial system information banner and the "sys" command will
display the total number of cpus as before, but will append the count
of offline cpus. Lastly, a fix has been made for the initialization
time determination of the maximum number of per-cpu objects queued
in a CONFIG_SLAB kmem_cache so as to continue checking all cpus
higher than the first offline cpu. These changes in behavior are not
dependent upon the setting of the crash "offline" variable.
(qiaonuohan(a)cn.fujitsu.com)
- Adjustment to the "offline" patch-set to make the initial system
banner, the "sys" command, and the X86_64 "mach" command, to only
show the "OFFLINE" cpu count if there are actually offline cpus.
(anderson(a)redhat.com)
- Make the "bt -E" option conform to a "-c cpu(s)" specification when
the the two options are used together. Without the patch, "bt -E"
ignores a cpu specifier.
(anderson(a)redhat.com)
- Fix for the determination of the cpu count on 32-bit ARM machines.
Without the patch, if certain patterns of cpus are offline, the count
may be too small, causing cpu-dependent commands to not recognize
online cpus.
(Jan.Karlsson(a)sonymobile.com, anderson(a)redhat.com)
- Fix for a missing exception frame dump by the X86_64 "bt" command
when an IRQ is received while a task is running on its per-cpu
interrupt stack with interrupts enabled.
(anderson(a)redhat.com)
- Fix for the determination of the cpu count on ARM64 machines.
Without the patch, if certain patterns of cpus are offline, the count
may be too small, causing cpu-dependent commands to not recognize
online cpus.
(Jan.Karlsson(a)sonymobile.com, anderson(a)redhat.com)
- Fix for a possible SIGSEGV generated during session initialization
while "please wait... (determining panic task)" is being displayed.
This was caused by a patch introduced in crash-7.0.8, and can only
happen when analyzing dumpfiles whose header does not contain the
requisite information to determine the panic task and the active
tasks do not have any crash-related traces in their kernel stacks.
It should be noted that the SIGSEGV can be avoided by entering
"--no_panic" on the crash command line.
(anderson(a)redhat.com)
- Fix for a SIGSEGV generated by the "bt -a" or "help -r" commands
if the NT_PRSTATUS notes in a compressed kdump are invalid/corrupt.
If all cpus are online but the dumpfile initialization that cycles
through the NT_PRSTATUS notes does not find exactly one note per
cpu, then the register contents in those notes should not be used.
(anderson(a)redhat.com)
- Fix for data access from "split" compressed kdump dumpfiles. Without
the patch, if a dumpfile read targets physical memory in the first
memory page stored in the second or later sequential split dumpfile,
incorrect data will be returned.
(qiaonuohan(a)cn.fujitsu.com)
- Correction of the copyright and authorship of ramdump.c.
(oza(a)broadcom.com)
- Added recognition of the new DUMP_DH_COMPRESSED_INCOMPLETE flag in
the header of compressed kdumps, and the new DUMP_ELF_INCOMPLETE flag
in the header of ELF kdumps. If the makedumpfile(8) facility fails
to complete the creation of compressed or ELF kdump vmcore files
due to ENOSPC or other error, it will mark the vmcore as incomplete.
If either flag is set, the crash utility will issue a warning that
the dumpfile is known to be incomplete during initialization, just
prior to the system banner display. When reads are attempted on
missing data, a read error will be returned. As an alternative,
zero-filled data will be returned if the "--zero_excluded" command
line flag is used, or the "zero_excluded" runtime variable is set
to "on". In either case, the read errors or zero-filled memory
may cause the crash session to fail entirely, cause commands to
fail, or may result in other unpredictable runtime behavior.
(anderson(a)redhat.com, zhouwj-fnst(a)cn.fujitsu.com)
- If a kernel has been configured with CONFIG_DEBUG_INFO_REDUCED, then
the crash utility will fail to initialize, typically with a message
indicating "no debugging data available". However, it has been
reported (on a 32-bit ARM system) that the initialization sequence
continued on beyond that message point, and the session failed later
on with the message "neither runqueue nor rq structures exist". As
an aid to understanding why the session failed, if the target kernel
is configured with CONFIG_IKCONFIG, and CONFIG_DEBUG_INFO_REDUCED has
been set to "y", a relevant warning message will be displayed.
(anderson(a)redhat.com)
- Implemented support for this Linux 3.18 commit for kernels that are
configured with CONFIG_SLAB:
commit bf0dea23a9c094ae869a88bb694fbe966671bf6d
mm/slab: use percpu allocator for cpu cache
The commit above redesigned the kmem_cache.array_cache[] from a
hardwired array to a per-cpu pointer referencing external array_cache
structures. Without the patch, the crash session would fail during
initialization with the message "crash: cannot resolve cache_cache".
Note that it could be worked around by using the "--no_kmem_cache"
command line option, with a resulting loss of functionality for
commands requiring slab-related data.
(anderson(a)redhat.com)
- Implemented a new "sys -t" option that displays kernel taint
information. If the "tainted_mask" symbol exists, the option will
show its hexadecimal value and translate each bit set to the symbolic
letter of the taint type. On kernels prior to 2.6.28 which had the
"tainted" symbol, only its hexadecimal value is shown. The relevant
kernel sources should be consulted for the meaning of the letter(s)
or hexadecimal bit value(s).
(anderson(a)redhat.com)
- Cosmetic fix for the "help -[n|D]" translation of the bitmap contents
of the kdump_sub_header.dump_level flag in compressed kdump dumpfiles.
(anderson(a)redhat.com)
- Fix for the support of compressed kdump clones created with the KVM
"virsh dump --memory-only --format <compression-type>" command,
where the compression-type is either "kdump-zlib", "kdump-lzo" or
"kdump-snappy". Without the patch, if an x86_64 guest kernel was loaded
with a non-zero "phys_base", the "--machdep phys_base=<offset>" command
line option was required as a workaround or the crash session would fail
with the warning message "WARNING: cannot read linux_banner string"
followed by the fatal error message "crash: vmlinux and <dumpfile name>
do not match!".
(anderson(a)redhat.com)
10 years, 1 month
Re: [Crash-utility] uniquely identifying KDUMP files that originate from QEMU
by Dave Anderson
----- Original Message -----
> From: HATAYAMA Daisuke <d.hatayama(a)jp.fujitsu.com>
> To: ptesarik(a)suse.cz
> Cc: lersek(a)redhat.com, kexec(a)lists.infradead.org
> Subject: Re: uniquely identifying KDUMP files that originate from QEMU
> Message-ID:
> <20141112.120838.303682123986142686.d.hatayama(a)jp.fujitsu.com>
> Content-Type: Text/Plain; charset=us-ascii
>
> From: Petr Tesarik <ptesarik(a)suse.cz>
> Subject: Re: uniquely identifying KDUMP files that originate from QEMU
> Date: Tue, 11 Nov 2014 13:09:13 +0100
>
> > On Tue, 11 Nov 2014 12:22:52 +0100
> > Laszlo Ersek <lersek(a)redhat.com> wrote:
> >
> >> (Note: I'm not subscribed to either qemu-devel or the kexec list; please
> >> keep me CC'd.)
> >>
> >> QEMU is able to dump the guest's memory in KDUMP format (kdump-zlib,
> >> kdump-lzo, kdump-snappy) with the "dump-guest-memory" QMP command.
> >>
> >> The resultant vmcore is usually analyzed with the "crash" utility.
> >>
> >> The original tool producing such files is kdump. Unlike the procedure
> >> performed by QEMU, kdump runs from *within* the guest (under a kexec'd
> >> kdump kernel), and has more information about the original guest kernel
> >> state (which is being dumped) than QEMU. To QEMU, the guest kernel state
> >> is opaque.
> >>
> >> For this reason, the kdump preparation logic in QEMU hardcodes a number
> >> of fields in the kdump header. The direct issue is the "phys_base"
> >> field. Refer to dump.c, functions create_header32(), create_header64(),
> >> and "include/sysemu/dump.h", macro PHYS_BASE (with the replacement text
> >> "0").
> >>
> >> http://git.qemu.org/?p=qemu.git;a=blob;f=dump.c;h=9c7dad8f865af3b778589dd...
> >>
> >> http://git.qemu.org/?p=qemu.git;a=blob;f=include/sysemu/dump.h;h=7e4ec5c7...
> >>
> >> This works in most cases, because the guest Linux kernel indeed tends to
> >> be loaded at guest-phys address 0. However, when the guest Linux kernel
> >> is booted on top of OVMF (which has a somewhat unusual UEFI memory map),
> >> then the guest Linux kernel is loaded at 16MB, thereby getting out of
> >> sync with the phys_base=0 setting visible in the KDUMP header.
> >>
> >> This trips up the "crash" utility.
> >>
> >> Dave worked around the issue in "crash" for ELF format dumps -- "crash"
> >> can identify QEMU as the originator of the vmcore by finding the QEMU
> >> notes in the ELF vmcore. If those are present, then "crash" employs a
> >> heuristic, probing for a phys_base up to 32MB, in 1MB steps.
> >>
> >> Alas, the QEMU notes are not present in the KDUMP-format vmcores that
> >> QEMU produces (they cannot be),
> >
> > Why? Since KDUMP format version 4, the complete ELF notes can be stored
> > in the file (see offset_note, size_note fields in the sub-header).
> >
>
> Yes, the QEMU notes is present in kdump-compressed format. But
> phys_base cannot be calculated only from qemu-side. We cannot do more
> than the efforts crash utility does for workaround. So, the phys_base
> value in kdump-sub header is now designed to have 0 now.
>
> Anyway, phys_base is kernel information. To make it available for qemu
> side, there's need to prepare a mechanism for qemu to have any access
> to it.
>
> One ad-hoc but simple way is to put phys_base value as part of
> VMCOREINFO note information on kernel.
>
> Although there has already been a similar one in VMCOREINFO, like
>
> arch/x86/kernel/
> ==
> void arch_crash_save_vmcoreinfo(void)
> {
> VMCOREINFO_SYMBOL(phys_base); <---- This
> VMCOREINFO_SYMBOL(init_level4_pgt);
>
> ...
> ==
>
> this is meangless, because this value is a virtual address assigned to
> phys_base symbol. To refer to the value of phys_base itself, we need
> the phys_base value we are about to get now.
>
> So, instead, if we change this to save the value, not value of symbol
> phys_base, we can get phys_base from the VMCOREINFO.
>
> The VMCOREINFO consists simply of string. So it's easy to search
> vmcore for it e.g. using strings and grep like this:
>
> $ strings vmcore-3.10.0-121.el7.x86_64 | grep -E ".*VMCOREINFO.*" -A 100
> VMCOREINFO
> OSRELEASE=3.10.0-121.el7.x86_64
> PAGESIZE=4096
> ...
> SYMBOL(phys_base)=ffffffff818e5010 <-- though this is address of phys_base
> now...
> SYMBOL(init_level4_pgt)=ffffffff818de000
> SYMBOL(node_data)=ffffffff819f1cc0
> LENGTH(node_data)=1024
> CRASHTIME=1399460394
> ...
>
> This should also be useful to get phys_base of 2nd kernel, which is
> inherently relocated kernel from a vmcore generated using qemu dump.
>
> This is far from well-designed from qemu's point of view, but it would
> be manually easier to get phys_base than now.
>
> Obviously, the VMCOREINFO is available only if CONFIG_KEXEC is
> enabled. Other users cannot use this.
>
> --
> Thanks.
> HATAYAMA, Daisuke
I agree that the actual value of phys_base should be included in the vmcoreinfo.
However, it won't help in this case because the vmcoreinfo data is not
copied into the compressed dumpfile header. The offset_vmcoreinfo and
size_vmcoreinfo fields are zero.
Here's an example header dump of a QEMU-generated dumpfile:
crash> help -n
makedumpfile header:
signature: "makedumpfile"
type: 1
version: 1
all_flat_data:
num_array: 18695
array: 7f484b760010
file_size: 0
diskdump_data:
filename: vmcore.ovmf.rhel7.kdump-snappy
flags: c6 (KDUMP_CMPRS_LOCAL|ERROR_EXCLUDED|LZO_SUPPORTED|SNAPPY_SUPPORTED) [FLAT]
dfd: 3
ofp: 3e441b1260
machine_type: 62 (EM_X86_64)
header: 1a68fe0
signature: "KDUMP "
header_version: 6
utsname:
sysname:
nodename:
release:
version:
machine: x86_64
domainname:
timestamp:
tv_sec: 0
tv_usec: 0
status: 4 (DUMP_DH_COMPRESSED_SNAPPY)
block_size: 4096
sub_hdr_size: 1
bitmap_blocks: 76
max_mapnr: 1245184
total_ram_blocks: 0
device_blocks: 0
written_blocks: 0
current_cpu: 0
nr_cpus: 4
tasks[nr_cpus]: 0
0
0
0
sub_header: 0 (n/a)
sub_header_kdump: 1a69ff0
phys_base: 0
dump_level: 1 (0x1) (DUMP_EXCLUDE_ZERO)
split: 0
start_pfn: (unused)
end_pfn: (unused)
offset_vmcoreinfo: 0 (0x0)
size_vmcoreinfo: 0 (0x0)
offset_note: 4200 (0x1068)
size_note: 3232 (0xca0)
num_prstatus_notes: 4
notes_buf: 1a6b000
notes[0]: 1a6b000
notes[1]: 1a6b164
notes[2]: 1a6b2c8
notes[3]: 1a6b42c
NT_PRSTATUS_offset: 1068
11cc
1330
1494
offset_eraseinfo: 0 (0x0)
size_eraseinfo: 0 (0x0)
start_pfn_64: (unused)
end_pfn_64: (unused)
max_mapnr_64: 1245184 (0x130000)
data_offset: 4e000
block_size: 4096
block_shift: 12
bitmap: 7f484b713010
bitmap_len: 311296
max_mapnr: 1245184 (0x130000)
dumpable_bitmap: 7f484b6c6010
byte: 0
bit: 0
compressed_page: 1a8c660
curbufptr: 1a7f650
...
Note that QEMU does add self-generated register dumps above, but the special
"QEMU" note that is added to ELF kdumps is not included.
Also note that the kernel version information is also left zero-filled.
In any case, if either a QEMU note or a diskdump.data flag were added, I would
be more than happy.
Dave
10 years, 1 month
Re: [Crash-utility] uniquely identifying KDUMP files that originate from QEMU
by Laszlo Ersek
adding back a few CC's because this discussion is useful
On 11/12/14 19:43, Petr Tesarik wrote:
> V Wed, 12 Nov 2014 15:50:32 +0100
> Laszlo Ersek <lersek(a)redhat.com> napsáno:
>
>> On 11/12/14 09:04, Petr Tesarik wrote:
>>> On Wed, 12 Nov 2014 12:08:38 +0900 (JST)
>>> HATAYAMA Daisuke <d.hatayama(a)jp.fujitsu.com> wrote:
>>
>>>> Anyway, phys_base is kernel information. To make it available for qemu
>>>> side, there's need to prepare a mechanism for qemu to have any access
>>>> to it.
>>>
>>> Yes. I wonder if you can have access without some sort of co-operation
>>> from the guest kernel itself. I guess not.
>>
>> Propagating any kind of additional information from the guest kernel
>> (which is unprivileged and potentially malicious) to the host-side qemu
>> process (which is by definition more privileged, although still confined
>> by various measures) is something we'd explicitly like to avoid.
>>
>> Think of it like this. I throw a physical box at you, running Linux,
>> that has frozen in time. Can "crash" work with nothing else but the
>> contents of the memory, and information about the CPUs?
>
> If only you could save the _complete_ state of the CPU... For example
> the content of CR3 would be quite useful.
(1) CR3 is already saved, in both the ELF and the kdump compressed formats.
- ELF case:
qmp_dump_guest_memory() [dump.c]
create_vmcore()
dump_begin()
write_elf64_notes()
loop from 1 to #vcpu:
cpu_write_elf64_note() [qom/cpu.c]
x86_64_write_elf64_note() [target-i386/arch_dump.c]
writes "CORE"
loop from 1 to #vcpu:
cpu_write_elf64_qemunote() [qom/cpu.c]
x86_cpu_write_elf64_qemunote() [target-i386/arch_dump.c]
cpu_write_qemu_note()
qemu_get_cpustate()
s->cr[3] = env->cr[3]; <---------- here
writes "QEMU"
Hence, the information is part of the QEMU note.
- kdump case:
qmp_dump_guest_memory() [dump.c]
create_kdump_vmcore()
write_dump_header()
create_header64()
write_elf64_notes()
[... same as above ...]
The trick here is that the note-writer functions use a callback function
for actually outputting the data. So while in the ELF case the stuff
goes directly to a file, in the kdump case the notes are first saved in
a memory buffer, and then later saved in the file at offset
KdumpSubHeader64.offset_note. (... Which is then represented in the
flattened file format of course.)
So, the information is there in both cases.
(2) Dave -- this just made me realize that the QEMU note is *already*
there in the kdump file as well; pointed-to by
KdumpSubHeader64.offset_note, for a length of KdumpSubHeader64.note_size.
>From your other email
<http://thread.gmane.org/gmane.linux.kernel.kexec/12787/focus=12797>:
> sub_header_kdump: 1c9cff0
> phys_base: 0
> dump_level: 1 (0x1) (DUMP_EXCLUDE_ZERO)
> split: 0
> start_pfn: (unused)
> end_pfn: (unused)
> offset_vmcoreinfo: 0 (0x0)
> size_vmcoreinfo: 0 (0x0)
> offset_note: 4200 (0x1068) <----------- here
> size_note: 3232 (0xca0) <-----------
> num_prstatus_notes: 4
> notes_buf: 1c9e000
> notes[0]: 1c9e000
> notes[1]: 1c9e164
> notes[2]: 1c9e2c8
> notes[3]: 1c9e42c
> NT_PRSTATUS_offset: 1068
> 11cc
> 1330
> 1494
> offset_eraseinfo: 0 (0x0)
> size_eraseinfo: 0 (0x0)
> start_pfn_64: (unused)
> end_pfn_64: (unused)
> max_mapnr_64: 1245184 (0x130000)
Can you fetch that in "crash"? If you can, then there's nothing to do on
the qemu side (and I'll have to apologize for spamming a bunch of lists :/).
I think "crash" already iterates over all of the notes in the note
buffer, but skips everything different from NT_PRSTATUS.
(3) Regarding the structure of the notes, we have to consider the
placement of the notes and their internal structure. The placement is
different between the ELF and the KDUMP file format. The internal
structure of the notes is identical between the two file formats.
For example, for a 4 VCPU guest, you end up with note names like
CORE
CORE
CORE
CORE
QEMU
QEMU
QEMU
QEMU
All of these are Elf64_Nhdr structures. The CORE ones have type
NT_PRSTATUS, and the QEMU ones have type 0.
(3a) The placement in the ELF file is already handled by "crash". Each
note "simply" gets its own ELF note segment/section.
(3b) In the kdump file, the Elf64_Nhdr structures (8 pieces in total, in
the above example -- 4x CORE, 4x QEMU) are concatenated in that order,
and finally stored at "offset_note".
(3c) Regarding the internal structure of the notes. The CORE ones are
already known and handled. The QEMU notes have the following structure:
> Elf64_Nhdr:
> n_namesz: 5 ("QEMU")
> n_descsz: 432
> n_type: 0 (?)
> 000001b000000001 0000000000000000
|------||------| |--------------|
size version rax
> 0000000000000000 0000000000000000
|--------------| |--------------|
rbx rcx
> 0000000000000000 0000000000000001
|--------------| |--------------|
rdx rsi
> ffffffff81dd5228 ffffffff81a01ec8
|--------------| |--------------|
rdi rsp
> ffffffff81a01ec8 0000000000000000
|--------------| |--------------|
rbp r8
> 0000000000000000 00000013911d5f29
|--------------| |--------------|
r9 r10
> 0000000000000000 ffffffff81c00480
|--------------| |--------------|
r11 r12
> 0000000000000000 ffffffffffffffff
|--------------| |--------------|
r13 r14
> 000000000309f000 ffffffff810375ab
|--------------| |--------------|
r15 rip
> 0000000000000246 ffffffff00000010
|--------------| |------||------|
rflags cs/lim cs/sel
> 0000000000a09b00 0000000000000000
|------||------| |--------------|
cs/pad cs/flags cs/base
> ffffffff00000018 0000000000c09300
|------||------| |------||------|
ds/lim ds/sel ds/pad ds/flags
> 0000000000000000 ffffffff00000018
|--------------| |------||------|
ds/base es/lim es/sel
> 0000000000c09300 0000000000000000
|------||------| |--------------|
es/pad es/flags es/base
> ffffffff00000000 0000000000000000
|------||------| |------||------|
fs/lim fs/sel fs/pad fs/flags
> 0000000000000000 ffffffff00000000
|--------------| |------||------|
fs/base gs/lim gs/sel
> 0000000000000000 ffff880003200000
|------||------| |--------------|
gs/pad gs/flags gs/base
> ffffffff00000018 0000000000c09300
|------||------| |------||------|
ss/lim ss/sel ss/pad ss/flags
> 0000000000000000 ffffffff00000000
|--------------| |------||------|
ss/base ldt...
> 0000000000000000 0000000000000000
|------||------| |--------------|
...ldt
> 0000208700000040 0000000000008b00
|------||------| |------||------|
tr...
> ffff880003213b40 0000007f00000000
|--------------| |------||------|
...tr gdt...
> 0000000000000000 ffff880003204000
|------||------| |--------------|
...gdt
> 00000fff00000000 0000000000000000
|------||------| |------||------|
idt...
> ffffffff81dd2000 000000008005003b
|--------------| |--------------|
...idt cr0
> 0000000000000000 0000000001b2e000
|--------------| |--------------|
cr1 cr2
> 0000000007b18000 00000000000006f0
|--------------| |--------------|
cr3 cr4
>From "target-i386/arch_dump.c":
> struct QEMUCPUSegment {
> uint32_t selector;
> uint32_t limit;
> uint32_t flags;
> uint32_t pad;
> uint64_t base;
> };
>
> typedef struct QEMUCPUSegment QEMUCPUSegment;
>
> struct QEMUCPUState {
> uint32_t version;
> uint32_t size;
> uint64_t rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp;
> uint64_t r8, r9, r10, r11, r12, r13, r14, r15;
> uint64_t rip, rflags;
> QEMUCPUSegment cs, ds, es, fs, gs, ss;
> QEMUCPUSegment ldt, tr, gdt, idt;
> uint64_t cr[5];
> };
>
> typedef struct QEMUCPUState QEMUCPUState;
Summary: I think the info is all there.
Thanks
Laszlo
10 years, 1 month
uniquely identifying KDUMP files that originate from QEMU
by Laszlo Ersek
(Note: I'm not subscribed to either qemu-devel or the kexec list; please
keep me CC'd.)
QEMU is able to dump the guest's memory in KDUMP format (kdump-zlib,
kdump-lzo, kdump-snappy) with the "dump-guest-memory" QMP command.
The resultant vmcore is usually analyzed with the "crash" utility.
The original tool producing such files is kdump. Unlike the procedure
performed by QEMU, kdump runs from *within* the guest (under a kexec'd
kdump kernel), and has more information about the original guest kernel
state (which is being dumped) than QEMU. To QEMU, the guest kernel state
is opaque.
For this reason, the kdump preparation logic in QEMU hardcodes a number
of fields in the kdump header. The direct issue is the "phys_base"
field. Refer to dump.c, functions create_header32(), create_header64(),
and "include/sysemu/dump.h", macro PHYS_BASE (with the replacement text
"0").
http://git.qemu.org/?p=qemu.git;a=blob;f=dump.c;h=9c7dad8f865af3b778589dd...
http://git.qemu.org/?p=qemu.git;a=blob;f=include/sysemu/dump.h;h=7e4ec5c7...
This works in most cases, because the guest Linux kernel indeed tends to
be loaded at guest-phys address 0. However, when the guest Linux kernel
is booted on top of OVMF (which has a somewhat unusual UEFI memory map),
then the guest Linux kernel is loaded at 16MB, thereby getting out of
sync with the phys_base=0 setting visible in the KDUMP header.
This trips up the "crash" utility.
Dave worked around the issue in "crash" for ELF format dumps -- "crash"
can identify QEMU as the originator of the vmcore by finding the QEMU
notes in the ELF vmcore. If those are present, then "crash" employs a
heuristic, probing for a phys_base up to 32MB, in 1MB steps.
Alas, the QEMU notes are not present in the KDUMP-format vmcores that
QEMU produces (they cannot be), hence crash has no way to tell apart
such files from those generated by genuine kdump. As an end result,
"crash" cannot automatically find the phys_base of OVMF-based Linux vmcores.
Dave suggested that a new flag, or a special phys_base value (like ~0UL)
be introduced as a distinguishing mark for QEMU-produced kdumps.
Implementing this in QEMU wouldn't be hard. The big question is
compatibility -- whose analysis tools would be broken by a (phys_base ==
~0UL) setting, or by a new flag?
Note that this change would affect SeaBIOS-based vmcores too. QEMU can't
(and shouldn't) discriminate the vmcores it dumps based on guest
firmware. (If QEMU did that, then it might as well try to figure out the
real phys_base value, which is clearly out of scope for qemu. One of the
selling points of the paging=false dump is that it doesn't involve
parsing guest RAM.)
Thanks
Laszlo
10 years, 1 month
[ANNOUNCE] crash gcore command, version 1.3.1 is released
by HATAYAMA Daisuke
This is the release of crash gcore command, version 1.3.1.
This release only aims at fixing building failure on x86 I overlooked
at the release of version 1.3.0.
ChangeLog:
[bugfixes]
- Fix building failure on x86 caused by a static reference to type
struct user_i387_struct that is used on x86_64 only. This reference
was introduced at v1.3.0 by the bugfix of segfault issue due to a
buffer overwrite of NT_FPREGSET. Correct one on x86 is struct
user_i387_ia32_struct, and we use it now.
(d.hatayama(a)jp.fujitsu.com)
MD5 CheckSum:
$ md5sum ./crash-gcore-command-1.3.1.tar.gz
b89be347111c0d26f3c0882e7ad09953 ./crash-gcore-command-1.3.1.tar.gz
--
Thanks.
HATAYAMA, Daisuke
10 years, 2 months
[ANNOUNCE] crash gcore command, version 1.3.0 is released
by HATAYAMA Daisuke
This is the release of crash gcore command, version 1.3.0.
This release newly adds ARM64 and PPC64 supports, thanks to respective
maintainers for their development of patch sets and verifications at
each rc release.
The remaining changes are all bugfixes.
# The ChangeLog includes those that appeared at each rc release.
ChangeLog:
[new features]
- Add ARM64 support. In addition to native ARM64 build, like crash
utility, we can build x86_64 executable of crash gcore command for
ARM64 crash dump by make target=ARM64, just like crash utility.
(anderson(a)redhat.com)
- Add ARM64 compat mode support. This allows gcore to create
corefiles for tasks running in 32-bit compatible mode on ARM64.
(weishu(a)marvell.com)
- Add PPC64 support. This includes both big-endian and little-endian
formats.
(mtoman(a)redhat.com, anderson(a)redhat.com)
[bugfixes]
- Correct a read buffer size for NT_FPREGSET as sizeof(struct
user_i387_struct). So far we had used sizeof(union thread_xstate)
falsely as a read buffer size but it had accidentally been equal to
sizeof(struct user_i387_struct). However, the following patch
extended union thread_xstate and sizeof(union thread_xstate) became
larger than sizeof(struct user_i387_struct):
commit e7d820a5e549b3eb6c3f9467507566565646a669
Author: Qiaowei Ren <qiaowei.ren(a)intel.com>
Date: Thu Dec 5 17:15:34 2013 +0800
x86, xsave: Support eager-only xsave features, add MPX support
Some features, like Intel MPX, work only if the kernel uses eagerfpu
model. So we should force eagerfpu on unless the user has explicitly
disabled it.
Add definitions for Intel MPX and add it to the supported list.
[ hpa: renamed XSTATE_FLEXIBLE to XSTATE_LAZY and added comments ]
Signed-off-by: Qiaowei Ren <qiaowei.ren(a)intel.com>
Link: http://lkml.kernel.org/r/9E0BE1322F2F2246BD820DA9FC397ADE014A6115@SHSMSX1...
Signed-off-by: H. Peter Anvin <hpa(a)linux.intel.com>
Without this patch, for vmcores whose kernel versions are v3.14 or
later, gcore results in segmentation fault due to a buffer overrite
of NT_FPREGSET.
(d.hatayama(a)jp.fujitsu.com)
- Although ELF_DATA is defined in gcore_defs.h, ELFDATA2LSB is used
directly at elf{64,32}_fill_elf_header(). There's so far been no
problem since the exisitng supported architectures are all
little-endian systems. Fix this to support PPC64 that uses
little-endian format.
(anderson(a)redhat.com)
- Fix a bug that registers in NT_PRSTATUS note information is
broken. This had been since v1.2.2 when O(1) note informaiton
collection was added. Without this fix, we can never get reliable
register values for failure analysis.
(weishu(a)marvell.com)
- Fix a bug that NT_386_IOPERM note information is not collected. So
far, ioperm_get() had always returned 1. As a result, NT_386_IOPERM
note information had never been not included in a generated core
file even if it is available for a given task on a given crash
dump.
(d.hatayama(a)jp.fujitsu.com)
- Add new member offset initialization for struct
nsproxy::pid_ns_for_children. In upstream, the following patch
renamed struct nsproxy::pid_ns into struct
nsproxy::pid_ns_for_children.
$ git log -1 c2b1df2e
commit c2b1df2eb42978073ec27c99cc199d20ae48b849
Author: Andy Lutomirski <luto(a)amacapital.net>
Date: Thu Aug 22 11:39:16 2013 -0700
Rename nsproxy.pid_ns to nsproxy.pid_ns_for_children
nsproxy.pid_ns is *not* the task's pid namespace. The name
should clarify that.
This makes it more obvious that setns on a pid namespace is weird --
it won't change the pid namespace shown in procfs.
Signed-off-by: Andy Lutomirski <luto(a)amacapital.net>
Reviewed-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Without this fix, gcore exited abnormally at its initialization
part and so core file is never generated.
(d.hatayama(a)jp.fujitsu.com)
- Fix a bug that a wrong way of checking return value of
fopen(). fopen() returns NULL in case of error, but gcore had seen
it as returning a minus integer. As a result, gcore continues
execution after the check even in case of error and then exits
abnormally at the first call of fwrite() with the broken file
pointer gcore failed to open.
From users' viewpoint, we face this bug when trying to overwrite an
existing corefile with more priviledged permission and resulting in
EPERM failure.
(d.hatayama(a)jp.fujitsu.com)
MD5 CheckSum:
$ md5sum ./crash-gcore-command-1.3.0.tar.gz
d530b7211793f1541a0da5968a305f4d ./crash-gcore-command-1.3.0.tar.gz
--
Thanks.
HATAYAMA, Daisuke
10 years, 2 months