Magnus Damm wrote:
Hi again Dave,

Thanks for your comprehensive answer - I really appreciate it.

On Fri, 2006-10-20 at 09:56 -0400, Dave Anderson wrote:
> Magnus Damm wrote:
> > Hi Dave,
> >
> > Thanks for your comments!
> >
> > On Wed, 2006-10-18 at 10:49 -0400, Dave Anderson wrote:
> > > Kazuo Moriwaka wrote:
> > >
> > > > Hello,
> > > >
> > > > From: Dave Anderson <anderson@redhat.com>
> > > > Subject: Re: [Crash-utility] kdump format may be updated
> > > > Date: Tue, 17 Oct 2006 09:01:32 -0400
> > > >
> > > > > As we discussed before, it would have been preferable in my
> > > > > opinion to have the starting-point mfn value for all the
> > domains,
> > > > > thereby making the dumpfile usable for all domains instead of
> > > > > just dom0.  But I will be happy with at least getting this
> > change
> > > > > in place so that crash can be used directly on the xen
> > dumpfile
> > > > > for dom0 analysis without having to run it through some other
> > > > > utility.
> > > >
> > > > Yes, I remember the discussion and I think it is possible to
> > make headers.
> > > > Now, Magnus is cleaning the patch.  He and I discussed, but I'm
> > not
> > > > enough to convince him that dom0 information is need for crash.
> > >
> > > Just to clarify this discussion.  Magnus's patch *does* include
> > the
> > > dom0 cr3 information for x86, and I am quite happy with that.
> > With that
> > > single, simple, dom0 cr3 value, the crash utility can use the
> > common
> > > xen/dom0 vmcore file unmodified.
> >
> > Yes, the patch includes that information today. But say that this
> > value
> > wouldn't be provided as a crash note - isn't it possible for
> > software to
> > locate the value anyhow?
> >
> >
> Believe me -- if you could tell me how, I would do it, and
> we could avoid this discussion...

Ok, let me try and you can shoot down my trivial attempt:

In the Xen case with the current implementation of crash notes, we save
the registers in this per-cpu variable: xen/common/kexec.c

DEFINE_PER_CPU (note_buf_t, crash_notes);

Wouldn't it be possible to look up the crash_notes symbol and find the
saved registers there?

Isn't the Xen hypervisor mapped like va = pa + offset?
 
 

I don't know.  If I look at the xen-syms file, the VirtAddr
and PhysAddr in the PT_LOAD segment are both 0xff100000,
which doesn't make sense to me:

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x001000 0xff100000 0xff100000 0x92658 0x92658 RWE 0x1000

But anyway, I would presume that the the crash_notes contents
in the xen binary contains registers that only make sense
with respect to xen, and even if you put the xen cr3 value
there, it doesn't have any relationship to the cr3 value
required for translating dom0 linux kernel virtual addresses.

In your patch here:

  [Xen-devel] [PATCH 03/04] Kexec / Kdump: x86_32 specific code

the dom0 cr3 is located by the find_dom0_cr3() function, which
has to walk the xen binary's "domain_list"

+void find_dom0_cr3(void)
+{
+       struct domain *d;
+       struct vcpu   *v;
+       uint32_t *buf;
+       uint32_t cr3;
+       Elf_Note note;
+
+       /* Don't need to grab domlist_lock as we are the only thing running */
+
+       /* No need to traverse domain_list, as dom0 is always first */
+       d = domain_list;
+       BUG_ON(d->domain_id);
+
+       for_each_vcpu ( d, v ) {
+               if ( test_bit(_VCPUF_down, &v->vcpu_flags) )
+                       continue;
+               buf = (uint32_t *)per_cpu(crash_notes, v->processor);
+               if (!buf) /* XXX: Can this ever occur? */
+                       continue;
+
+               memcpy(&note, buf, sizeof(Elf_Note));
+               buf += (sizeof(Elf_Note) +3)/4 + (note.namesz + 3)/4 +
+                       (note.descsz + 3)/4;
+
+               /* XXX: This probably doesn't take into account shadow mode,
+                * but that might not be a problem */
+               cr3 = pagetable_get_pfn(v->arch.guest_table);
+
+               buf = append_elf_note(buf, "Xen Domanin-0 CR3",
+                       NT_XEN_DOM0_CR3, &cr3, 4);
+               final_note(buf);
+
+               printk("domain:%i vcpu:%u processor:%u cr3:%08x\n",
+                      d->domain_id, v->vcpu_id, v->processor, cr3);
+       }
+}
+

So when you ask, can I "look up the crash_notes symbol and find the
saved registers there?", I'm presuming that you are talking about
the crash_notes variable above?  But if you're planning on removing
the code above, why would there be any crash_notes variable remaining?

In any case, you can see that what is needed is the per-domain cr3 value,
which is embedded in the xen-binary data.

Or, as I mentioned before, alternatively, the pfn_to_mfn_frame_list_list
value from the per-domain arch_shared_info also would work:

struct arch_shared_info {
    unsigned long max_pfn;                  /* max pfn that appears in table */
    /* Frame containing list of mfns containing list of mfns containing p2m. */
    xen_pfn_t     pfn_to_mfn_frame_list_list;
    unsigned long nmi_reason;
    uint64_t pad[32];
};

But again, there's no easy way for the crash utility to dig
them out of a completely foreign binary's.
 

 
> > Or maybe the CR3 value isn't saved at all? Shouldn't it be saved
> > instead
> > when all the other registers are saved?
> >
> > And why do we chose to save CR3 and not CR4? I know what CR3 is
> > used
> > for, but I kind of dislike the ad-hoc approach of how these things
> > are
> > added. Also, maybe more registers are needed under Xen compared to
> > the
> > Linux kernel.
> >
> > I'm especially thinking about segment registers and descriptor
> > tables.
> >
> A bit of history -- and I don't mean to "dumb-down" the discussion...
>
> If you just give me a dump of memory, I can make the crash utility
> work
> with it, as long as I can take a physical address and turn it into a
> file offset into the dumpfile.
>
> The first set of physical memory reads that crash does take the
> unity-mapped virtual addresses of key kernel data, strip off the
> identifier and pass the physical address to the read function of the
> particular memory device, i.e., be it the live memory, netdumps,
> diskdumps, LKCD dumps, mcore dumps, kdumps, xen dumps, and xen
> "xm save" files.

You are not "dumbing-down" the discussion at all, I'm actually not
familiar with the crash code at all so your history helps a lot. WIth
unity-mapped I guess you mean va = pa + offset. If not please shout!
 

That's exactly right...
 
> The issue at hand with the xen kdump vmcore is that its ELF header
> describes physical memory in machine memory terms.  To use that
> vmcore with respect to dom0 instead of the xen binary, for each
> required physical address derived from a dom0 kernel virtual address,
> the physical address is a pseudo-physical address with no correlation
> with the physical (machine) memory descriptors in the ELF header.

What you are trying to do would work well with Xen itself right, but the
non-linear virtual to pseudo-physical mapping must be difficult.

> So, given that the dom0 pseudo-physical address needs to be
> translated into a machine address, I need to be able to find my way
> to the phys_to_machine_mapping array.  From that point on, it's
> becomes a matter of searching the array for the desired
> pseudo-physical
> address, getting the associated machine address, and then using
> the PT_LOAD segments of the ELF header to find the memory.

Yes, this makes sense.

> To find the phys_to_machine_mapping array, there are two keys
> to Pandora's box:
>
> (1) the dom0 cr3 value -- which in a writable page table kernel,
>     will contain an mfn value.  With that starting point, a page
>     table walk can be initiated for the "phys_to_machine_mapping"
>     virtual address.
>
> (2) alternatively, given the dom0 pfn_to_mfn_frame_list_list mfn,
>     I also have a starting point in order to reconstruct the
>     phys_to_machine_mapping array.
>
> Either one works.  I preferred #2 because it would presumably work
> for both writable and shadow page table kernels.  But, I've never
> done any work with shadow page table kernels (Red Hat is going with
> writable...), so I don't know what the ramifications are for those
> kernels.

I would go with #2 too, although I must admit that my knowledge about he
Xen internals is a bit limited. How do you locate this one? Through a
global symbol in hypervisor space?
 


Your knowledge is a hell of lot less limited than mine...  ;-)

Anyway, yes, it appears that the "domain_list" global is the key
that leads to either the per-domain pfn_to_mfn_frame_list_list or
the per-domain cr3 value.

With respect to the per-domain pfn_to_mfn_frame_list_list location,
in the xen sources, there's the following "shared_info" structure,
that contains the data that's shared between the xen binary and
and each domain.  And this structure contains the "arch_shared_info"
structure I showed above -- which contains the per-domain
pfn_to_mfn_frame_list_list value:

/*
 * Xen/kernel shared data -- pointer provided in start_info.
 *
 * This structure is defined to be both smaller than a page, and the
 * only data on the shared page, but may vary in actual size even within
 * compatible Xen versions; guests should not rely on the size
 * of this structure remaining constant.
 */
struct shared_info {
    struct vcpu_info vcpu_info[MAX_VIRT_CPUS];

    /*
     * A domain can create "event channels" on which it can send and receive
     * asynchronous event notifications. There are three classes of event that
     * are delivered by this mechanism:
     *  1. Bi-directional inter- and intra-domain connections. Domains must
     *     arrange out-of-band to set up a connection (usually by allocating
     *     an unbound 'listener' port and avertising that via a storage service
     *     such as xenstore).
     *  2. Physical interrupts. A domain with suitable hardware-access
     *     privileges can bind an event-channel port to a physical interrupt
     *     source.
     *  3. Virtual interrupts ('events'). A domain can bind an event-channel
     *     port to a virtual interrupt source, such as the virtual-timer
     *     device or the emergency console.
     *
     * Event channels are addressed by a "port index". Each channel is
     * associated with two bits of information:
     *  1. PENDING -- notifies the domain that there is a pending notification
     *     to be processed. This bit is cleared by the guest.
     *  2. MASK -- if this bit is clear then a 0->1 transition of PENDING
     *     will cause an asynchronous upcall to be scheduled. This bit is only
     *     updated by the guest. It is read-only within Xen. If a channel
     *     becomes pending while the channel is masked then the 'edge' is lost
     *     (i.e., when the channel is unmasked, the guest must manually handle
     *     pending notifications as no upcall will be scheduled by Xen).
     *
     * To expedite scanning of pending notifications, any 0->1 pending
     * transition on an unmasked channel causes a corresponding bit in a
     * per-vcpu selector word to be set. Each bit in the selector covers a
     * 'C long' in the PENDING bitfield array.
     */
    unsigned long evtchn_pending[sizeof(unsigned long) * 8];
    unsigned long evtchn_mask[sizeof(unsigned long) * 8];

    /*
     * Wallclock time: updated only by control software. Guests should base
     * their gettimeofday() syscall on this wallclock-base value.
     */
    uint32_t wc_version;      /* Version counter: see vcpu_time_info_t. */
    uint32_t wc_sec;          /* Secs  00:00:00 UTC, Jan 1, 1970.  */
    uint32_t wc_nsec;         /* Nsecs 00:00:00 UTC, Jan 1, 1970.  */

    struct arch_shared_info arch;

};

The best "common-code" example I see in the xen sources accessess the
shared_info structure above like so:

void evtchn_set_pending(struct vcpu *v, int port)
{
    struct domain *d = v->domain;
    shared_info_t *s = d->shared_info;

So looking at your currently existing find_dom0_cr3() function,
the vcpu pointer can be pulled the same way from the domain_list:

+void find_dom0_cr3(void)
+{
+       struct domain *d;
+       struct vcpu   *v;
+       uint32_t *buf;
+       uint32_t cr3;
+       Elf_Note note;
+
+       /* Don't need to grab domlist_lock as we are the only thing running */
+
+       /* No need to traverse domain_list, as dom0 is always first */
+       d = domain_list;
+       BUG_ON(d->domain_id);
+
+       for_each_vcpu ( d, v ) {
 

> In fact, my original suggestion was for an ELF note with an array
> of cr3's or pfn_to_mfn_frame_list_list mfn's, i.e., for dom0 and
> all the other domUs running on the system.  That would be an awesome
> capability -- a single vmcore that could be used against the xen
> binary using gdb, or with the crash utility for analyszing dom0
> or any of the other domU sessions.
>
> But, there apparently was an issue with the idea of having
> an ELF note with an undetermined size.

I agree that your idea sounds nice from the tool side of it, but things
get complex for us because we don't know the maximum number of domains
so we cannot allocate a static array. And we don't want to allocate
things at crash time either...

I think that Moriwaka-sans tool is able to find the pfn to mfn mapping
data without any special notes, but I may be wrong. Isn't it possible to
backtrack the internal data structures of Xen?

> Anyway, like I mentioned to Kazuo, I'd be happy with just the
> dom0 "key"...

That's what we have now. And I want to keep you happy. But I also want
the code as simple as possible, so I need to find a balance. I think the
code is pretty straight forward as-is today though. But I'm worried
about the fact that we only save CR3 and not say pointers to and sizes
of local and global descriptor tables.

> Also, in the ELF NT_PRSTATUS note, the cr3 value is not stored
> since it consists of a processor-specific user_regs_struct.

Do you mean that registers that are stored in ELF_NT_PRSTATUS should be
user registers, not system registers? That's my interpretation of it.

> So if you were to append it there, I could use that instead.
> Your initial patch seems to have put it both there and
> in the new NT_XEN_DOM0_CR3 note:
>
> +/* The cr3 for dom0 on each of its vcpus
> + * It is added as ELF_Prstatus prstatus.pr_reg[ELF_NGREG-1)], where
> + * prstatus is the data of the elf note, and ELF_NGREG was extended
> + * by one to allow extra space.
> + * This code runs after all cpus except the crashing one have
> + * been shutdown so as to avoid having to hold domlist_lock,
> + * as locking after a crash is playing with fire */

That comment was wrong. It is fixed in the patchset posted to xen-devel
today - 200610123. Sorry about the mess.

> +               buf = append_elf_note(buf, "Xen Domanin-0 CR3",
> +                       NT_XEN_DOM0_CR3, &cr3, 4);
>
> But, again, only for x86.

It should do it both for i386 and x86_64. And that's the only two
architectures with CR3, right? =)

That's correct.
 

> > > What I don't understand is whether the same thing is going to be
> > done
> > > for x86_64?
> > >
> > > NT_XEN_DOM0_CR3 is #define'd in xen/include/xen/elfcore.h in
> > > this patch:
> > >
> > >   [Xen-devel] [PATCH 02/04] Kexec / Kdump: Code shared between
> > x86_32 and x86_64
> > >
> > > NT_XEN_DOM0_CR3 is used by the find_dom0_cr3() function in
> > > xen/arch/x86/crash.c, in this patch:
> > >
> > >   [Xen-devel] [PATCH 03/04] Kexec / Kdump: x86_32 specific code
> > >
> > > But there is no analogous x86_64 usage in this patch:
> > >
> > >   [Xen-devel] [PATCH 04/04] Kexec / Kdump: x86_64 specific code
> > >
> > > Is NT_XEN_DOM0_CR3 not being used by x86_64 by mistake,
> > > or on purpose?  Or perhaps you're saying that it's going to be
> > > pulled out completely?
> >
> > Yes and maybe. It was a mistake to put it into that patch, but I
> > think
> > the patch ends up in the common code anyway. Next version will be
> > improved.
> >
> > Ripping it out? Well, I'm tempted. But nothing is decided. Feel free
> > to
> > argue. =)
> >
> > The main reason behind it is that we need to make kexec-tools
> > xen-aware.
> > And this awareness may lead to that we don't need any crash notes at
> > all
> > in the Xen case. Mostly because kexec-tools only knows about dom0,
> > but
> > the crash dump is about the entire system.
> >
> > Today we have some code that stores the elf notes in the hypervisor
> > in
> > the same format as the kernel, and to pass these notes we need to
> > hook
> > in hypercalls in the kernel so that the user-space tool can find
> > the
> > address of the crash notes.
> >
> > I think it would make sense to let dom0 save it's registers within
> > dom0
> > using the good old crash note format, but to use a simpler format
> > for
> > registers for the hypervisor. And the let the tools locate the
> > registers
> > by looking up global symbols.
> >
> > > > I know you don't want to treat xen binary file with crash, but
> > I'm
> > > > not clear why.  Please discuss with him directly to make up xen
> > kdump
> > > > file formats.  The patch will be merged into xen-3.0.4.
> > > > I hope we can find solution before merge.
> > > >
> > >
> > > The crash utility is wholly based upon the internal structure
> > > of the Linux kernel.
> >
> > So why can't you just require that Xen dumps needs to be cut out
> > with
> > dom0 cut?
> >
> >
> Well, to answer your question with a question:
>
>  Why should it be required if it could be so easily avoided?
>
> As Henry David Thoreau said, "Simplicity, simplicity, simplicity..."

I'm glad to hear that. That means that we both want simplicity. =)

I think we should save registers that are not saved today, and I would
be happy to add that data to our crash notes in the hypervisor.

But can you locate the crash notes without any reference from the ELF
header?
 

As I explained above, I don't see how?  I can't maneuver around
following multiple data structure linkages -- using data structures
that crash knows nothing about.  (Not to mention knowing how exactly
the xen virtual-to-physical translation works...)

But I'm still confused about that -- why would the "crash_notes"
exist in the xen sources/binary if you're not going to put them
in the ELF header of a resultant xen dumpfile?  What exactly
is going to be put into the dumpfile's ELF header?

Dave