Code and Bitters

In a busy multithreaded program that opens and closes files a lot, it's surprisingly easy for a minor C mistake to cause data corruption.

For example:

// returns 0 on success
int check_file(int fd)
{
    char buf[16];
    // read the first 16 bytes of the file
    pread(fd, buf, 16, 0);
    if (valid_file_header(buf)) {
        return 0;
    } else {
        close(fd);
        return ERROR_BAD_HEADER;
    }
}

void consume_file(const char *filename)
{
    int fd = open(filename, O_RDWR);
    if (fd < 0) {
        perror("failed to open file");
        return;
    }

    int status = check_file(fd);
    if (status) {
        fprintf(stderr, "file header check failed\n");
        close(fd);
        return;
    }

    process_file_data(fd);
}

See the bug? if valid_file_header() fails, both check_file and consume_file will attempt to close the same file descriptor.

This code will always work fine in a single-threaded program. The programmer's mistake has no effect, because this is perfectly safe:

open("filename") -> 3 // kernel allocated fd 3 and returned it.
close(3)              // kernel closes fd 3.
close(3)              // 3 is not a valid fd; kernel returns EBADF

Almost nobody ever checks the return value of close. We might save ourselves some headaches by wrapping the call like this:

void check_close(int fd)
{
    int rv = close(fd);
    assert(rv == 0);
}

Wrapping all calls to close() might seem a bit paranoid. But that paranoia is justified if you want the code to be thread-safe (and remain thread-safe across years of maintenance).

I want to demonstrate the data corruption, but I also want to highlight how annoying and difficult in can be to debug the problem once it occurs. Let's say the buggy code above is combined in a multithreaded program which also does this:

void write_database_entry(const char* file, int row, int value)
{
    int db = open(file, O_RDWR);
    if (db < 0) {
        perror("failed to open database file");
        return;
    }
    pwrite(db, &value, sizeof(value), calc_row_offset(row));
    close(db);
}

void write_log_entry(const char* logfile, const char* str, size_t len)
{
    int log = open(logfile, O_WRONLY|O_APPEND);
    if (log < 0) {
        perror("failed to open log file");
        return;
    }
    write(log, str, len);
    close(log);
}

The details of this code are less important. What does matter is that we may have different threads opening files, writing, and closing. Note that these two functions are not themselves buggy. (I am omitting error-checking on pwrite and write just to keep the example short.)

Now that we have multiple threads doing open and close, and one of those code paths calls close() twice, we've opened the door to a big problem. One day, the owner of this code is going to get a phone call in the middle of the night because a database file is corrupt.

We find that part of the database has been overwritten with a log message. Probably what happens next is somebody gets to spend the next few hours staring at write_log_entry trying to figure out what the problem is. But that function is just an innocent victim of the original double-close bug.

Here's the order of events that caused this particular corruption:

      Thread A                Thread B                Thread C
   consume_file()          write_log_entry        write_database_entry
   and check_file()
   -----------------       --------------------   --------------------
1. consume_file
   open("badfile") -> 3
2. check_file close(3)
3.                         open("logfile") -> 3
4. consume_file close(3)
5.                                                open("database") -> 3
6.                                                pwrite(3, ...)
7.                         pwrite(3, ...)
8.                         close(3)
9.                                                close(3) -> EBADF

This timing exploits a race condition: if someone (in this case write_log_entry) calls open() between the duplicate calls to close() scary things start happening.

Thread B is the first victim: its file descriptor was suddenly closed before it had a chance to do anything. Had that thread attempted any I/O at that time it would have received EBADF and we might have a clue something has gone wrong, but as long as this thread isn't using it, the file descriptor problem is still undetected.

Thread C is the ultimate victim: thread C opened fd 3 as the file "database", but thread B believes fd 3 points to "logfile". So thread B scribbles on the "database" file unwittingly.

A double-close bug can affect any code in the same process that opens files. Or anything that shares the file descriptor space: sockets, for example. The double-close can trigger a process of "file descriptor juggling"-- one buggy thread can cause other well-behaved threads to pass file descriptors from thread to thread unintentionally. And there's no limit to how many times the fd can be "passed" from one thread to another; the example above can be extended to an arbitrary number of threads (across an unbounded amount of time).

It's interesting to note that eventually, some thread is bound to see a close() return EBADF (Assuming there isn't an independent fd leak bug). If you ever see this in a multithreaded program, it's possible you have a double-close bug-- though you won't know where. The reason this bug can be so confusing is that you don't know when and where the fd juggling began.

So, what can be done to defend against this kind of problem creeping in?

In C your options are limited. Always check the return value of close(): if you get EBADF your code is very dangerous and should be considered not thread-safe. You can also keep an eye out for EBADF in other places (reads and writes)-- that might also be an indication you have a problem like this somewhere in the same process.

Better would be to stash the raw file descriptor in an object and implement a close method that atomically closes and overwrites (with -1 perhaps) the fd.

What if you use fopen() and fclose()? These use a struct FILE* parameter so you'd guess that they have correct behavior. But the man page for fclose() says this:

The  behaviour of fclose() is undefined if the stream
parameter is an illegal pointer, or is a descriptor
already passed to a previous invocation of fclose().

On Ubuntu 14.10 (glibc 2.19) it does appear to protect you against the double-close... sorta. A test program that calls fclose() twice crashes due to a double-free in _IO_new_fclose(). So this version doesn't call close() twice (though I can't say exactly why-- the glibc code is pretty unreadable). Both the documentation warnings about undefined behavior and the fact that they're willing to bumble into a double-free instead make me not trust this code very far.

It's easy to see how this sort of error happens, though: manually managing object lifetime is no fun. In C++ you'd probably do it the other way around: instead of attempted-close causing destruction of an object, use destruction as a signal to close the file descriptor.

Given the seriousness of this problem, it would be nice if the C library were to treat close() returning EBADF as a fatal error (and crash), just as it does with double-free or other flavors of memory-allocator-related corruption. It's too late to force every old program to start checking the return value, but it would be better to force crashes now than risk silent data corruption later.

PS. Just in case you were wondering, if you ever see close() returning EINTR you should probably be unhappy all over again, because it's not clear whether the file descriptor was closed (and perhaps reallocated on another thread) or not.

code and bitters