Sunday, November 16, 2008

Strange backtrace

Some time ago I had to debug a strange crash. It was in a multithreaded program and manifested itself only on FreeBSD i386. The code (with all the needed declarations included, seemingly irrelevant details removed, and everything renamed) looks like this:


#include <cstdio>
#include <cstring>
#include <cerrno>
#include <fcntl.h>
#include <unistd.h>
#include <exception>


class system_error : public exception
{
public:
system_error() throw() :error_text(strerror(errno)) {}
virtual const char* what() const throw() { return error_text; }
private:
const char* error_text;
};

class strange_thing
{
public:
strange_thing(); // fills in some useful defaults
private:
// lots of implementation details
};

class strange_container
{
public:
strange_container();
~strange_container() { if (fd != -1) close(fd); }
void play_with_strange_thing(const char* filename);
private:
int fd;
};

strange_container::strange_container()
: fd(-1)
{
play_with_strange_thing("test.file");
}

void strange_container::play_with_strange_thing(const char* filename)
{
fd = open(filename, O_CREAT | O_TRUNC | O_RDWR, 0777);
if (fd == -1)
throw system_error();
strange_thing ss;
/* here goes some code that uses ss and fd */
}

// well, actually it is not the main function, but something buried in a thread
int main(int argc, char* argv[])
{
strange_container c;
return 0;
}



The test file is not created, and the segfault looks like this:

[aep@bsd1 ~/crashtest]$ gdb ./a.out
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd"...
(gdb) run
Starting program: /usr/home/aep/crashtest/a.out
[New LWP 100043]
[New Thread 0x28301100 (LWP 100043)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x28301100 (LWP 100043)]
strange_container::play_with_strange_thing (this=0x28306098,
filename=0x8048c33 "test.file") at crashtest.cpp:57
57 fd = open(filename, O_CREAT | O_TRUNC | O_RDWR, 0666);


At this point, everything looks valid, including the "this" pointer. By all applicable logic, the program just cannot segfault by calling open() with valid parameters. So I started adding debugging printf() statements. The statement just before the call to play_with_strange_thing() worked fine, and none of the statements inside play_with_strange_thing() worked. Moreover, when I added a printf() as the very first line of strange_container::play_with_strange_thing() and ran gdb on the result, it showed this printf() in the backtrace!

So, I didn't believe my eyes. I thought (wrongly) that printf() and buffering somehow interacts with the segfault, and thus invented a different mechanism to find out whether a certain line of code was reached by the program. Namely, I replaced all my debugging printf() calls in strange_container::play_with_strange_thing() with throwing exceptions, with the intention to remove them one-by-one:


void strange_container::play_with_strange_thing(const char* filename)
{
throw system_error(); // (1)
fd = open(filename, O_CREAT | O_TRUNC | O_RDWR, 0777);
throw system_error();
if (fd == -1)
throw system_error();
throw system_error(); // (2)
strange_thing ss;
throw system_error(); // (3)
/* here goes some code that uses ss and fd, err... throws system_error() */
}


This worked. I knew for sure that point (2) was reached, and point (3) was't. So there is something bad with creation of the strange thing that, however, doesn't cause gdb to complain about its constructor.

So, I had to take another look into the implementation of the strange_thing class. The issue was actually with the huge size of the object (several megabytes)! So, no wonder that it overflowed the thread stack. You can reproduce the crash on your own system by replacing "lots of implementation details" with "char c[2000000];", implementing the strange_thing default constructor, and running with a low-enough "ulimit -s" setting.

As the reason of the crash became known, it was an easy matter to fix properly, by not creating huge objects on the stack.

No comments: