----- "Jeffrey Hagen" <Jeffrey.Hagen(a)teradata.com> wrote:
Hi Petr and Dave,
I have a couple of comments on Petr's email regarding CPU count.
When the dump is the result of an NMI (nmi switch pressed) due to a hung
system, one often needs to analyze the state and backtrace for all the
CPU's. Since the kernel halts all but CPU0, the crash utility cannot
see the other "offline" CPU's.
I've never seen that behavior before. Probably because I've never seen
an x86_64 dumpfile that was created as a result of the NMI switch being
pressed? Anyway, are you saying that the NMI switch shutdown handler
takes the other cpus offline?
This behavior has changed for the x86 architecture somewhere between
2.6.16 (SLES10) and 2.6.32 (SLES11) due to the removal of the x8664_pda
structure.
The function x86_64_init (in x86_64.c) now calls x86_64_per_cpu_init
which doesn't count the offline CPUS when calculating the number of
CPU's. Previously, x86_64_cpu_pda_init (called if x8664_pda exists),
didn't check for online/offline status.
Again -- I've never seen this behaviour before.
In any case, I'll look at any patch suggestions you guys have in mind.
Thanks,
Dave
Regarding #3 in Petr's email. It appears that the set command
won't
accept a value >= kt_cpus (number of CPUS). It doesn't check if the CPU
is offline or not.
Thanks,
Jeff Hagen
>
> Hi all,
>
> before making a larger cleanup, I want to ask here for your
opinion.
It
> seems that there is quite a bit of confusion about the meaning of
CPU
> count printed out by the crash utility.
>
> 1. Number of CPUs
>
> Some people think that crash should always output the number of
CPUs
in
> the system (ie. a quad-core server should always output 'CPUS: 4'),
> while other people think that only online CPUs should be counted.
>
> 2. CPU numbering
>
> For example, if there are 4 CPUs in the system, but some of them
are
> taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the number
of
> online CPUs, it would print out 'CPUS: 2'. It's not easy to find
out
> that valid CPU numbers are 0 and 2 in this case.
Hi Petr,
For all but ppc64, the number shown by the initial banner and the
"sys" command is essentially "the-highest-cpu-number-plus-one".
For ppc64 (as requested and implemented by the IBM/ppc64
maintainers),
it shows the number of online cpus. There's reasons for doing it
either of the two ways, but I'm on vacation now, and you can research
the list archives for the various arguments for-and-against doing it
either way. Check the changelog.html for when it was changed for
ppc64, and then cross-reference the revision date with the list
archives.
> 3. Examining offline CPU
>
> Sometimes, it may be useful to examine the state of an offline CPU.
Now,
> I know that the saved state is most likely stale, but it can be
useful
> in some cases (e.g. a crash after dropping to kdb). The crash
utility
> currently refuses to select an offline CPU with 'set -c #'. Are
there
> any concerns about allowing it?
I tend to agree with you, but the only thing that's useful and
available from an offline cpu is the swapper task for that cpu
and the runqueue for that cpu. And both of those entities are
readily accessible if you really need them. Although I don't know
anything about kdb status, so maybe there's something of per-cpu
interest, but I don't know why it would be necessary to "set"
that cpu?
In any case, like I said before, I'm just temporarily online while
on vacation, and will be back to work on the 9th.
Thanks,
Dave
--
Crash-utility mailing list
Crash-utility(a)redhat.com
https://www.redhat.com/mailman/listinfo/crash-utility