Hi Dave,
Attached is our suggested patch for the issue with CPU count in
an NMI switch induced coredump. Basically the change uses the
cpu_present_mask instead of the cpu_online_mask in x86_64_per_cpu_init
and x86_64_get_smp_cpus.
I understand why you need to do it that way, but to make a change like
this makes me a little nervous because nobody's ever reported this
situation before, and I'm somewhat paranoid it may lead to unexpected
behavior. Plus there are old kernels that don't even have a cpu_present_map.
In answer to your question below: "Are you saying that the NMI
switch shutdown handler takes the other cpus offline?" --- Yes!!
Thanks,
Jeff
-----Original Message-----
From: crash-utility-bounces(a)redhat.com
[mailto:crash-utility-bounces@redhat.com] On Behalf Of Dave Anderson
Sent: Thursday, August 12, 2010 6:22 AM
To: Discussion list for crash utility usage,maintenance and
development
Subject: Re: [Crash-utility] Question on online/present/possible CPUS
----- "Jeffrey Hagen" <Jeffrey.Hagen(a)teradata.com> wrote:
> Hi Petr and Dave,
>
> I have a couple of comments on Petr's email regarding CPU count.
>
> When the dump is the result of an NMI (nmi switch pressed) due to a
hung
> system, one often needs to analyze the state and backtrace for all
the
> CPU's. Since the kernel halts all but CPU0, the crash utility
cannot
> see the other "offline" CPU's.
I've never seen that behavior before. Probably because I've never
seen
an x86_64 dumpfile that was created as a result of the NMI switch
being
pressed? Anyway, are you saying that the NMI switch shutdown handler
takes the other cpus offline?
> This behavior has changed for the x86 architecture somewhere
between
> 2.6.16 (SLES10) and 2.6.32 (SLES11) due to the removal of the
x8664_pda
> structure.
> The function x86_64_init (in x86_64.c) now calls
x86_64_per_cpu_init
> which doesn't count the offline CPUS when calculating the number of
> CPU's. Previously, x86_64_cpu_pda_init (called if x8664_pda
exists),
> didn't check for online/offline status.
Again -- I've never seen this behaviour before.
In any case, I'll look at any patch suggestions you guys have in
mind.
Thanks,
Dave
> Regarding #3 in Petr's email. It appears that the set command
won't
> accept a value >= kt_cpus (number of CPUS). It doesn't check if
the
CPU
> is offline or not.
>
> Thanks,
>
> Jeff Hagen
>
>
>
> >
> > Hi all,
> >
> > before making a larger cleanup, I want to ask here for your
> opinion.
> It
> > seems that there is quite a bit of confusion about the meaning of
> CPU
> > count printed out by the crash utility.
> >
> > 1. Number of CPUs
> >
> > Some people think that crash should always output the number of
> CPUs
> in
> > the system (ie. a quad-core server should always output 'CPUS:
4'),
> > while other people think that only online CPUs should be counted.
> >
> > 2. CPU numbering
> >
> > For example, if there are 4 CPUs in the system, but some of them
> are
> > taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the
number
> of
> > online CPUs, it would print out 'CPUS: 2'. It's not easy to find
> out
> > that valid CPU numbers are 0 and 2 in this case.
>
> Hi Petr,
>
> For all but ppc64, the number shown by the initial banner and the
> "sys" command is essentially "the-highest-cpu-number-plus-one".
> For ppc64 (as requested and implemented by the IBM/ppc64
> maintainers),
> it shows the number of online cpus. There's reasons for doing it
> either of the two ways, but I'm on vacation now, and you can
research
> the list archives for the various arguments for-and-against doing
it
> either way. Check the changelog.html for when it was changed for
> ppc64, and then cross-reference the revision date with the list
> archives.
>
> > 3. Examining offline CPU
> >
> > Sometimes, it may be useful to examine the state of an offline
CPU.
> Now,
> > I know that the saved state is most likely stale, but it can be
> useful
> > in some cases (e.g. a crash after dropping to kdb). The crash
> utility
> > currently refuses to select an offline CPU with 'set -c #'. Are
> there
> > any concerns about allowing it?
>
> I tend to agree with you, but the only thing that's useful and
> available from an offline cpu is the swapper task for that cpu
> and the runqueue for that cpu. And both of those entities are
> readily accessible if you really need them. Although I don't know
> anything about kdb status, so maybe there's something of per-cpu
> interest, but I don't know why it would be necessary to "set"
> that cpu?
>
> In any case, like I said before, I'm just temporarily online while
> on vacation, and will be back to work on the 9th.
>
> Thanks,
> Dave
>
> --
> Crash-utility mailing list
> Crash-utility(a)redhat.com
>
https://www.redhat.com/mailman/listinfo/crash-utility
--
Crash-utility mailing list
Crash-utility(a)redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility
--
Crash-utility mailing list
Crash-utility(a)redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility