Hi Don,
I still need some help understanding this "resurrection" of the remote
access facility, i.e., the new meanings of the REMOTE_xxx flags/macros.
The currently-existing comment for the is_remote_daemon() function
in remote.c describes the (obsolete) usage of the facility:
/*
* Parse, verify and establish a connection with the network daemon
* specified on the crash command line.
*
* The format is: [remote-hostname]:port[,remote-namelist][,remote-dumpfile]
*
* where everything but the port number is optional, and the remote-namelist
* and remote-dumpfile can be reversed.
*
* 1. The default remote host is the local host.
* 2. The default dumpfile is /dev/mem.
* 3. If no remote-namelist and remote-dumpfile are given, the daemon
* is queried for a kernel that matches the remote /proc/version.
* If no local kernel namelist is entered, the remote version will
* be copied locally when fd_init() is called.
* 4. If a remote-dumpfile is given with no remote namelist, it is presumed
* that the kernel namelist will be entered locally.
*/
Back then, the remote "crashd" daemon could be contacted for:
(1) live access to the remote machine's kernel using its /dev/mem, or
(2) accessing a vmlinux/vmcore pair that existed on the remote machine.
Can you explain this new usage? It sounds like there is a remote daemon
running on a remote Dom0 kernel, and that your new functionality can only
be used to:
(1) access a running domU guest hosted by that remote system, or
(2) access a paused domU guest hosted by that remote system.
And that there is also a device that the the remote daemon uses
to access the memory of the guest, be it paused or live.
Am I in the ballpark?
Dave
----- Original Message -----
On 11/19/13 11:08, Dave Anderson wrote:
>
> ----- Original Message -----
>> From: Don Slutz <dslutz(a)verizon.com>
>>
>> Currently crash 4.0 and later using the code:
>>
>>
http://lists.xen.org/archives/html/xen-devel/2013-11/msg02569.html
>>
>> There is some very limited documention on crashes remote ptotocol in:
>>
>>
http://lists.xen.org/archives/html/xen-devel/2013-11/msg02352.html
>>
>> and some fixes in the attachment
>> 0001-xen-crashd-Connect-crash-with-domain.patch
>> from the 1st link above.
>>
>> How ever there are issues with the current code:
>>
>> 1) The remote protocol has a minor issue (which I have not been able to
>> happen) based on the fact TCP/IP is stream based protocol. This means
>> a RECV or a SEND may not do the fully requested size of data. In fact
>> the current code assumes that the amount of data that a SEND is called
>> with will all be read with a single RECV.
>>
>> 2) The most common mismatch between crash and older kernels is
>> phys_base. In the remote case, see if the remote server supports
>> vitrual memory access, and if so, see if phys_base can be retreived.
>>
>> 3) crash assumes that the remote system is active and can not return
>> currect IP and SP to do a better back trace.
>>
>> 4) enable a non-active mode of remote access.
>>
>> This patch set attempts to fix these:
>>
>> The fix for #1 I have called NIL mode (patch 1).
>> The fix for #2 uses get_remote_phys_base (patch 3).
>> The fix for #3 uses get_remote_regs (patch 5).
>> The fix for #4 uses special file /dev/xenmem (patch 7).
>> It also REMOTE_NIL() to indicate "remote paused system".
>>
>> Don Slutz (7):
>> Add NIL mode to remote.
>> remote_proc_version: NULL terminate passed buffer on error.
>> Add get_remote_phys_base.
>> Add remote_vtop.
>> bt: get remote live registers if possible.
>> Add get_remote_cr3
>> Add support for non-live remote.
>>
>> defs.h | 8 +-
>> kernel.c | 25 +++-
>> memory.c | 25 +++-
>> remote.c | 462
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++----------
>> x86_64.c | 24 ++++
>> 5 files changed, 465 insertions(+), 79 deletions(-)
>>
>> --
>> 1.8.4
>
> Hi Don,
>
> A few comments...
>
> I don't really understand what entity the code is communicating with,
> so for the most part, I'm primarily interested in making sure that your
> patches in no way can affect the normal usage of the crash utility.
>
> That being said, is the term "NIL mode" some kind of Xen-ism?
Nope, it is term I came up with to describe the protocol change.
> It wasn't
> until I saw the display_sys_stats() function that I even understood that
> REMOTE_NIL mean "(remote paused system)". Anyway, if you made it
something
> like "REMOTE_PAUSED" or something like that, the code might make some
sense
> to the untrained eye.
Will do. Looking back at it, I ended up overloading the term. :(
>
> And it's not entirely clear to me what REMOTE_ACTIVE() means now? It used
> to imply communicating with a remote crash daeamon that accessed the live
> system that the daemon was running on. Has that meaning been changed to
> mean
> that crash is communicating with a daemon that's accessing a "live"
Xen
> guest
> kernel?
The meaning is not as clear. Since the existing versions of crash already
have REMOTE_ACTIVE() and I wanted the Xen utility to be usable to them, in
both cases the "live" Xen guest kernel is being accessed while it is paused.
I added REMOTE_PAUSED() to have a few changes to crash to correctly
understand that the guest is paused. As part of this I add the fake file
"/dev/xenmem".
>
> I'm confused about these two changes, where you re-patch your own patch:
>
> - else
> + else if (REMOTE_ACTIVE() && ((bt->task == tt->this_task) ||
> is_task_active(bt->task))) {
> + if (get_remote_regs(bt, &eip, &esp))
> + machdep->get_stack_frame(bt, &eip, &esp);
> + } else
>
> followed by:
>
> - else if (REMOTE_ACTIVE() && ((bt->task == tt->this_task) ||
> is_task_active(bt->task))) {
> + else if (REMOTE_NIL() && ((bt->task == tt->this_task) ||
> is_task_active(bt->task))) {
> if (get_remote_regs(bt, &eip, &esp))
> machdep->get_stack_frame(bt, &eip, &esp);
> } else
Clearly I need to adjust the patch order.
> If I'm not mistaken, it would seem to be impossible for "(bt->task ==
> tt->this_task)"
> to ever be true in the REMOTE_NIL() case? As I recall tt->this_task used
> to be the
> PID of the remote daemon itself, which would be accessing the memory of its
> own kernel.
> So how could the pid of the remote daemon ever be the target of a back
> trace of a
> paused guest? And so I don't understand why you first made an if clause
> for
> REMOTE_ACTIVE() to begin with -- whatever that means now -- and then
> changed it to
> REMOTE_NIL()?
This came about from some old changes from 2012 to a crash version 4.0 for
other work. I had not thought of /dev/xenmem at the time, and just copied
the condition from above. I see no reason not to drop "bt->task ==
tt->this_task". Will check that the right actions are still done.
> In any case, since the ultimate change is only interested in REMOTE_NIL(),
> can you make it all less of an eyesore by encapulating its if case
> something like:
>
> else if (SADUMP_DUMPFILE())
> get_sadump_regs(bt, &eip, &esp);
> + else if (REMOTE_NIL()) {
> + if (!((bt->task == tt->this_task) ||
> is_task_active(bt->task)) ||
> + !get_remote_regs(bt, &eip, &esp))
> + machdep->get_stack_frame(bt, &eip, &esp);
> } else
> machdep->get_stack_frame(bt, &eip, &esp);
>
> Note above, for consistency's sake with the rest of the crash utility,
> can you make get_remote_regs() return non-zero when it successfully gets
> the registers, and zero if it fails?
Sure.
> And lastly, can you fix these?
>
> $ make warn
> ...
> remote.c: In function ‘validate_phys_base’:
> remote.c:2302:25: warning: format ‘%llx’ expects argument of type
> ‘long long unsigned int’, but argument 3 has type ‘physaddr_t’
> [-Wformat]
> remote.c:2302:25: warning: format ‘%llx’ expects argument of type
> ‘long long unsigned int’, but argument 4 has type ‘physaddr_t’
> [-Wformat]
> remote.c:2302:25: warning: format ‘%llx’ expects argument of type
> ‘long long unsigned int’, but argument 5 has type ‘physaddr_t’
> [-Wformat]
> remote.c:2302:25: warning: format ‘%llx’ expects argument of type
> ‘long long unsigned int’, but argument 6 has type ‘physaddr_t’
> [-Wformat]
> remote.c: In function ‘remote_vtop’:
> remote.c:2361:2: warning: format ‘%llx’ expects argument of type
> ‘long long unsigned int’, but argument 4 has type ‘physaddr_t’
> [-Wformat]
> remote.c:2367:4: warning: format ‘%llx’ expects argument of type
> ‘long long unsigned int’, but argument 4 has type ‘physaddr_t’
> [-Wformat]
> remote.c:2352:18: warning: variable ‘p3’ set but not used
> [-Wunused-but-set-variable]
> remote.c:2352:13: warning: variable ‘p2’ set but not used
> [-Wunused-but-set-variable]
> remote.c:2352:8: warning: variable ‘p1’ set but not used
> [-Wunused-but-set-variable]
> remote.c: In function ‘get_remote_regs’:
> remote.c:2392:33: warning: unused variable ‘p6’ [-Wunused-variable]
> remote.c:2392:28: warning: variable ‘p5’ set but not used
> [-Wunused-but-set-variable]
> remote.c:2392:18: warning: variable ‘p3’ set but not used
> [-Wunused-but-set-variable]
> remote.c:2392:8: warning: variable ‘p1’ set but not used
> [-Wunused-but-set-variable]
> remote.c: In function ‘get_remote_cr3’:
> remote.c:2449:13: warning: variable ‘p2’ set but not used
> [-Wunused-but-set-variable]
> remote.c:2449:8: warning: variable ‘p1’ set but not used
> [-Wunused-but-set-variable]
> ...
Yes.
> And I'm not interested in appending these to the 4 files that you changed:
>
> +/*
> + * Specify Emacs local variables so the formating
> + * of the code stays the same.
> + *
> + * Local variables:
> + * mode: C
> + * c-set-style: "BSD"
> + * c-basic-offset: 8
> + * indent-tabs-mode: t
> + * End:
> + */
>
> Thanks,
> Dave
Ok, that is 2 votes to drop this, so I will.
-Don Slutz