How… interesting.

An open source program (ogg123) has been crashing repeatedly on my new laptop. The source code hasn’t changed in years. The stacktrace always points to __lll_unlock_elision() in glibc, which seems kind of scary. Searches on the web all seem to point back to problems with Haswell CPUs that have broken TSX instructions, and are running out-of-date microcode.

I spent a bit of panicking and trying to figure out why my microcode was wrong, but that seems to not be the problem. As far as I can tell I have a CPU that’s new enough that TSX instructions work, at least well enough that glibc wants to use its lock elision code on pthreads mutexes.

The program that’s crashing on my new laptop appears to have a bug in its mutex handling. Just before exiting, it tries to unlock a mutex, regardless of whether that mutex is actually locked at the time. I guess that’s incorrect, though pthreads must have tolerated this sort of thing in the past.

There’s a quick and dirty workaround: if I add a gratuitous call to pthread_mutex_trylock before the existing call to pthread_mutex_unlock, the program now works fine. pthread_mutex_trylock is returning 0, so it succeeded at locking the mutex before unlocking it again. This confirms the original call to unlock was wrong, trying to unlock an already-unlocked mutex.

What’s really interesting to me about this bug is that it looks like a glibc feature is turning working binaries into broken ones. The exact same code runs fine on every other system; it’s only when glibc turns on lock elision that the same program binary starts to crash.

As far as I can tell, there isn’t an environment variable or any similar way to disable lock elision on a particular program. So the only way to unbreak the newly broken program is to dig into the source code and fix the locking to not do that anymore.

I don’t think it’s a great idea for glibc to break working code in this way (for example), but I’m not feeling ambitious enough to pick a fight with them over it. I guess the best I can hope for is that this post gathers enough google rank that someone out there will waste less time than I did trying to solve a similar problem.

// Crashes on Skylake CPUs with lock elision enabled

#include <pthread.h>

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

void * worker_thread(void * arg)
{
    pthread_mutex_lock(&lock);
    pthread_mutex_unlock(&lock);
    return NULL;
}

int main()
{
    pthread_t tid;
    pthread_create(&tid, NULL, worker_thread, NULL);
    pthread_join(tid, NULL);
    pthread_mutex_unlock(&lock); // BUG! dies here in __lll_unlock_elision()
    return 0;
}

 

edit, Oct. 2017: Carlos O'Donell writes to add:

Elision in glibc for default pthread mutexes is now controllable via runtime tunables e.g. GLIBC_TUNABLE=glibc.elision.enable=0, and all processes are opted out of elision unless they enable elision explicitly. This change should be going out with glibc 2.27 in February 2018.