Rajesh wrote:
Dave,
Thanks for your explanation.
Well the reason behind my questions is, we have an application running
on customer site and the application consumes around 60GB of system memory.
When this process receives the segmentation fault or signal abort, the
kernel will start to take the process core dump. Here is the problem.
Kernel takes at least 1hr (60-minutes) to come out from core dump.
During this time the system is unresponsive (hung), and I feel it is
because the system is entering into thrashing due to huge memory usage
by the process. This long down time is not acceptable by the customer.
So I started to find the better way or tackling the problem.
1>First thing we thought is changing the system page size from 4KB to
8KB. Since this change could not be done on our x86_64 architecture,
since x86_64 architecture doesn’t support multi-page size option.
2>We wrote a program using libbfd API’s and used with in our
application. Whenever the SIGSEGV or SIGABRT is received by the process
it will log the stack trace of all the threads within that process. This
feature is not so effective or flexible as compared to process core dump.
3>Last we thought of using kcore/vmcore to analyze the cause for SIGSEGV
or SIGABRT.
4>I have one more thought, making the “elf_core_dump()” function SMP.
This function is responsible for dumping the core, and the function is
present in “/usr/src/linux/fs/binfmt_elf.c”
Any comments/ideas are welcome.
--Regards,
rajesh
Maybe tinker with maydump()?
If you know that the core dump contains the VMA's that are
not necessary to dump, such as large shared memory segments,
and you can identify them from the VMA, you can prevent
them from being copied to the core dump. There's this
patch floating around, which may have been updated:
http://lkml.org/lkml/2007/2/16/149
Dave