Note that most of the theory (particularly the low level segmentation fault details) are also valid for Windows platforms and other operating systems. The commands to configure core dumps and retrieve them are Linux specific though. I also assume that the program you are trying to debug is “FreeSWITCH”, but you can easily change the program name to the one is misbehaving in your case and you should be fine.
What is a core dump?
Sometimes problems with FreeSWITCH, Asterisk or just about any other program in Linux are hard to debug by just looking at the logs. Sometimes the process crashes and you don’t have a chance to use the CLI to look at stats. The logs may not reveal anything particularly interesting, or just some minimal information. Sometimes you need to dig deeper into the state of the program to poke around and see what is going on inside. Linux process core dumps are meant for that.
The most typical case for a core dump is when a process dies violently and unexpectedly. For example, if a programmer does something like this:
*(int *)0 = 0; |
Is usually not that straight forward, it may be the that the programmer did not expect certain variable to contain a NULL (0) value. The process is killed by the Linux kernel because by default, a Linux process does not map the memory address 0 to anything. This causes a page fault in the processor which is trapped by the kernel, the kernel then sees that the given process does not have anything mapped at address 0 and then sends the SIGSEGV (Unix signal) to the process. This is called a segmentation fault.
The default signal handler for SIGSEGV dumps the memory of the process and kills the process. This memory dump contains all the memory for this process (and all the threads belonging to that process) and that is what is used to determine what went wrong with the program that attempted to reference an invalid address (0 in this case, but any invalid address can cause it).
You can learn more about it here: http://en.wikipedia.org/wiki/Segmentation_fault
How can I make sure the core dump will be saved?
Each process has a limit for how big this core can be. If the limit is exceeded no core dump will be saved. By default this limit is 0!, that means no core will be dumped by default.
Before starting the process you must use “ulimit“. The “ulimit” command sets various limits for the current process. If you execute it from the bash shell, that means the limits are applied to your bash shell process. This also means any processes that you start from bash will inherit those limits (because they are child processes from your bash shell).
ulimit -a |
That shows you all the limits for your bash shell. In order to guarantee that a core will be dumped you must set the “core file size” limit to “unlimited”.
ulimit -c unlimited |
If you are starting the process from an init script or something like that, the init script has to do it. Some programs are smart enough to raise their limits themselves, but is always better to make sure you have unlimited core file size for your bash shell. You may then want to add those ulimit instructions inside your $HOME/.bashrc file.
Where is the core dump saved?
Each process has a “working directory”. See http://en.wikipedia.org/wiki/Working_directory
That is where the process core dump will be saved by default. However, some system-wide settings affect where the core is dumped.
“/proc/sys/kernel/core_pattern” and “/proc/sys/kernel/core_uses_pid” are 2 files that control the base file name pattern for the core, and whether the core name will be appended with the PID (process ID).
The recommended settings are:
mkdir -p /var/core echo "/var/core/core" > /proc/sys/kernel/core_pattern echo 1 > /proc/sys/kernel/core_uses_pid |
You can confirm what you just did with:
cat /proc/sys/kernel/core_pattern cat /proc/sys/kernel/core_uses_pid |
This settings will cause any process in the system that crashes, to dump the core at:
/var/core/core.<pid> |
What if I just want a core dump without killing the process?
In some situations, if a process becomes unresponsive or the response times are not ideal. For example, you try to execute CLI commands in Asterisk or FreeSWITCH but there is no output, or worst, your command line prompt gets stuck. The process is still there, it may be even processing calls, but some things are taking lot of time or just don’t get done. You can use “gdb” (The GNU debugger) to dump a core of the process without killing the process and almost with no disruption of the service. I say almost because for a large process, dumping a core may take a second or two, in that time the process is freezed by the kernel, so active calls may drop some audio (if you’re debugging some real time audio system like Asterisk or FreeSWITCH).
The trick to do it fast is to first create a file with the GDB commands required to dump the core.
Latest versions of CentOS include (with the gdb RPM package) the “gcore” command to do everything for you. You only need to execute:
gcore $(pidof freeswitch) |
To dump the core of the running process.
If you are in a system that does not include gcore, you can do the following:
echo -ne "generate-core-file\ndetach\nquit" > gdb-instructions.txt |
The 3 instructions added to the file are:
generate-core-file detach quit |
This will do exactly what we want. Generate the core file for the attached process, then detach from the process (to let it continue) and then quit gdb.
You then use GDB (you may need to install it with “yum install gdb”) to attach to the running process, dump the core, and get out as fast as possible.
gdb /usr/local/freeswitch/bin/freeswitch $(pidof freeswitch) -x gdb-instructions.txt |
The arguments to attach to the process include the original binary that was used to start it and the PID of the running process. The -x switch tells GDB to execute the commands in the file after attaching.
The core will be named core.<pid> by default and the path is not affected by the /proc/sys/kernel settings of the system.
This core can now be used by the developer to troubleshoot the problem.
Sometimes though, the developer will be more interested in a full back trace (stack trace), because the core dump itself can’t be easily examined in any other box than the one where was generated, therefore it might be up to you to provide that stack trace, which you can do with:
gdb /usr/local/freeswitch/bin/freeswitch core.<pid> (gdb) set logging file my_back_trace.txt (gdb) thread apply all bt full (gdb) quit |
Then send the file my_back_trace.txt to the developer, or analyze it yourself, sometimes is easy to spot the problems even without development experience!
Thanks for sharing your tips Kapil!
Indeed. I am not very fond of C++, but GDB scripting is a very nice way to debug more complex data structures.
Have you heard of Project Archer? I have not had the time to install it, but I remember wanting something like that when I was working in a complex C++ application that used to crash all the time 😛
http://sourceware.org/gdb/wiki/ProjectArcher
Hey Moy,
Nice article on GDB/Core !!
Just to add we can also take the advantage of GDB script to debug any core very fast.
For example , see GDB script from below location which can use to access the C++ STLs(List/Vector/Strings etc) to see if STLs corruption are the cause of your segmentation fault.
http://sourceware.org/ml/gdb/2008-02/msg00064/stl-views.gdb
Like this we can create the GDB script as per our requirements to access all of our complex data structures, memory management structures etc so while analyzing core we can quickly check what went wrong.