When Exceptions Collide: Managing Errors in ArangoDB
Managing Exception Collisions | ArangoDB BlogThis is a story of an excursion to the bottom of a deep rabbit hole, where I discovered a foot gun in `gcc`’s `libgcc`. The investigation has cost me several days and I hope that by writing this up I can entertain others and save them the journey.
TL;DR
If a C++ application is compiled with GCC on Linux and statically linked against a non-GLibC C-library (like `libmusl`), then there is a danger of a data race which leads to a busy loop happening after `main()` and all static destructors have finished. The race happens, if the application does not use `pthread_cancel` explicitly and if the very first exception which is thrown in the processes’ life is thrown in two different threads at the same time.
Background
This is about a problem which came up during CI-testing a large pull request for ArangoDB. Although all tests were green, every once in a while an `arangod` process would not terminate properly. The test would eventually time out and Jenkins would show me a bad red ball and deliver a core from a SIGABORT.
Before I tell you about the core, let me explain the setting. ArangoDB is a multi-threaded application written in C++. We build the executable on Linux in a Docker container using Alpine Linux and `libmusl` as a C-library, because we can then build a single self-contained statically linked executable which runs on basically all variants of Linux (which run on a `x86_64` architecture). This simplifies packaging and greatly helps our support team, since they only have to deal with a single binary for each version. The binary even runs on Windows using the
“Windows subsystem for Linux” (WSL)! See details about this in this article.
The Problem
So what did the core show? There was only a single thread left, and the stack trace was this:
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f4d1442cc57 in __deregister_frame_info_bases (begin=0x7f4d14adaf60)
at /home/buildozer/aports/main/gcc/src/gcc-8.3.0/libgcc/unwind-dw2-fde.c:222
222 /home/buildozer/aports/main/gcc/src/gcc-8.3.0/libgcc/unwind-dw2-fde.c: No such file or directory.
(gdb) bt
#0 0x00007f4d1442cc57 in __deregister_frame_info_bases (begin=0x7f4d14adaf60)
at /home/buildozer/aports/main/gcc/src/gcc-8.3.0/libgcc/unwind-dw2-fde.c:222
#1 __deregister_frame_info_bases (begin=0x7f4d14adaf60)
at /home/buildozer/aports/main/gcc/src/gcc-8.3.0/libgcc/unwind-dw2-fde.c:201
#2 0x00007f4d1199b6ce in __do_global_dtors_aux ()
#3 0x0000000000000027 in ?? ()
#4 0x00007f4d1199e192 in runServer (argc=0, argv=0x0, context=...)
at /work/ArangoDB/arangod/RestServer/arangod.cpp:284
#5 0x00007ffddc78f898 in ?? ()
#6 0x00007f4d14447af5 in _fini ()
#7 0x0000000000000000 in ?? ()
Note that `main()` has already terminated! We are in the middle of `__do_global_dtors_aux`, which is a function GCC creates to run all destructors for global objects. As it turned out later, all these destructors have already completed and we are already at the stage in which the C++ runtime does some cleanup and shutdown on its own. The process was aborted by the CI system, but at this stage, I was completely puzzled as to what the process might have been doing to prevent the proper shutdown.
And: How could it possibly be our fault or our PR’s fault???
The effect did not happen in every test run, but it was happening often enough to be annoying. And so my journey started…
First findings
Inspecting the core at first did not reveal much, since I did not know my way around the C++ runtime code in `libgcc`. Also, I could no longer single-step (since it was a core dump) and, obviously, I could not reproduce the problem in a debugger. However, I relatively quickly found a short loop in the assembler code and could look at the data structures it manipulates, to prove in the core that a busy loop was happening (if you do not know `x86_64` assembler, just ignore this part and believe me that there is a busy loop):
0x7f4d1442cc46 <__deregister+103> lea 0x28(%rbx),%rax
0x7f4d1442cc4a <__deregister+107> mov 0x28(%rbx),%rbx
0x7f4d1442cc4e <__deregister+111> test %rbx,%rbx
0x7f4d1442cc51 <__deregister+114> je 0x7f4d1442cc82 <__deregister+163>
0x7f4d1442cc53 <__deregister+116> mov 0x18(%rbx),%rdi
0x7f4d1442cc57 <__deregister+120> testb $0x1,0x20(%rbx)
0x7f4d1442cc5b <__deregister+124> jne 0x7f4d1442cc6b <__deregister+140>
0x7f4d1442cc5d <__deregister+126> cmp %rdi,%rbp
0x7f4d1442cc60 <__deregister+129> jne 0x7f4d1442cc46 <__deregister+103>
(I shortened `__deregister_frame_info_bases` to `__deregister` for the
sake of brevity in the presentation!)
With `%rbx` being `0x7f4d1568e460` and the memory at this address looking like this:
0x7f4d1568e460 : 0x00007f4d11995000 0x0000000000000000
0x7f4d1568e470 <object.6149+16>: 0x0000000000000000 0x00007f4cd6f00b00
0x7f4d1568e480 <object.6149+32>: 0x000000002190d0d9 0x00007f4d1568e460
</object.6149+32></object.6149+16>
it is clear that there is a busy loop, since the jump to `__deregister+163` never happens. Looking at the C code in the GCC-library `libgcc`, we were in a loop which searches a singly linked list of objects of type `struct object` (whatever that is!):
struct object
{
void *pc_begin;
void *tbase;
void *dbase;
union {
const struct dwarf_fde *single;
struct dwarf_fde **array;
struct fde_vector *sort;
} u;
union {
struct {
unsigned long sorted : 1;
unsigned long from_array : 1;
unsigned long mixed_encoding : 1;
unsigned long encoding : 8;
unsigned long count : 21;
} b;
size_t i;
} s;
struct object *next;
};
(comments and dead code removed for brevity!)
On 64-bit machines, the object contains exactly six 64-bit words and the `next` pointer happens to point to the object itself, thus creating a circular data structure!
This finding took me quite some time, since I had to dig out old assembler knowledge and carefully look around in the core to find this. The theory was confirmed by our monitoring on the CI machines, which indeed showed that towards the end of the test run a single core was running at 100% CPU.
At this stage, I had no idea about the data structures in `libgcc`. However, in the C-code I saw two singly linked lists `unseen_objects` and `seen_objects`, which were static in the file `libgcc/unwind-dw2-fde.c`, so all the code dealing with them is in this module!
The only place modifying the `seen_objects` list is further down in the same file in the function `_Unwind_Find_FDE`, where objects in the `unseen_objects` list are processed and moved to the `seen_objects` list. As far as I could see, the code was completely correct (although I notoriously get code about linked lists wrong!). Furthermore, it was neatly protected by a mutex, as can be seen here. Surely, this code:
init_object_mutex_once ();
__gthread_mutex_lock (&object_mutex);
would protect the data structures against concurrent access by different threads. How could the list become circular???
Caught in the act
I had just proven that the code is correct and that the effect I saw in the core could never happen!
Out of sheer desperation I thought to myself that I had to actually catch it in the act of ruining the data structure. Furthermore, I was interested which piece of our code would trigger such an event.
The next thing I did was to run the tests in their native environment (many tests at the same time, high load on the machines), and put a probe into the unmodified executable using the Linux `perf` tool.
Much to my surprise, this was actually quite simple, I used the following commands to set up the probe and record the events:
sudo perf probe -x build/bin/arangod -a "unwind=unwind-dw2-fde.c:1072 ob p"
sudo perf record -e "probe_arangod*" -aR --call-graph dwarf
(hit Control-C after the recording)
sudo perf script > perf.history
This should essentially create an event every time the `seen_objects` list was modified. And in the case of an event I would see not only the complete call trace but also the value of the local variables `ob` and `p`. If I can catch it with this enabled, I should at least see which piece of our code triggered the bad behavior!
After a few tries the shutdown busy loop actually happened! I was very curious as to what I would find, and indeed, I now had a new core with the actual address of the loopy list and an event log of `perf` which should show the bad modification! And what I found was this (this is exactly happening here):
MaintenanceWork 8985 [025] 4713776.942241: probe_arangod:unwind: (7f4d1442cd77) ob=0x7f4d1568e460 p=0x7f4d165b4590
2c93d77 _Unwind_Find_registered_FDE (inlined)
2c93d77 _Unwind_Find_FDE (/work/ArangoDB/build/bin/arangod)
2c91474 uw_frame_state_for (/work/ArangoDB/build/bin/arangod)
2c92065 uw_init_context_1 (/work/ArangoDB/build/bin/arangod)
2c924a9 _Unwind_RaiseException (/work/ArangoDB/build/bin/arangod)
2c3ea25 __cxa_throw (/work/ArangoDB/build/bin/arangod)
520c33 replicationSynchronize (/work/ArangoDB/build/bin/arangod)
5251d3 arangodb::maintenance::SynchronizeShard::first (/work/ArangoDB/build/bin/arangod)
ffffffffffffffff [unknown] ([unknown])
MaintenanceWork 8981 [012] 4713776.992994: probe_arangod:unwind: (7f4d1442cd77) ob=0x7f4d1568e460 p=0x7f4d1568e488
2c93d77 _Unwind_Find_registered_FDE (inlined)
2c93d77 _Unwind_Find_FDE (/work/ArangoDB/build/bin/arangod)
2c91474 uw_frame_state_for (/work/ArangoDB/build/bin/arangod)
2c92065 uw_init_context_1 (/work/ArangoDB/build/bin/arangod)
2c924a9 _Unwind_RaiseException (/work/ArangoDB/build/bin/arangod)
2c3ea25 __cxa_throw (/work/ArangoDB/build/bin/arangod)
520c33 replicationSynchronize (/work/ArangoDB/build/bin/arangod)
5251d3 arangodb::maintenance::SynchronizeShard::first (/work/ArangoDB/build/bin/arangod)
ffffffffffffffff [unknown] ([unknown])
So indeed, the same exception was thrown at approximately the same time (50ms difference, which is of course an eternity!), but on a busy system with many processes running, a thread can be suspended for some time before it gets a time slice again. The output of the local variable `ob` actually shows that the same object was inserted into the singly linked list twice. At the time of the second insert the list is ruined and made loopy, because of the following. Assume the list looks like this:
A -next-> B -next-> C -next-> 0
and the same object B is inserted again. Because of some descending sorting by addresses which is going on, the position in which B is inserted is directly behind B itself. Normally, to insert a D there, the `next` pointer of D is changed to point to C and the `next` pointer of B is changed to point to D, yielding this:
A -next-> B -next-> D -next> C -next-> 0
But if D and B are identical, the result will be:
A -next-> B <-next-> B
with B’s `next` pointer pointing to B itself. So what must have happened is that two threads were trying to do this manipulation at the same time. One must have grabbed the object from the `unseen_objects` list, but then been suspended. Then the other grabbed the object, too, and subsequently both inserted the same object into the `seen_objects` list.
However: How could this theory possibly be true, if there is a mutex around the whole code???
What goes wrong?
It turns out that the mutex is not actually used! However, this is cleverly hidden by C preprocessor magic. The idea is that the mutex in this place is not necessary, if the whole program is single-threaded. So, it sounds like a good idea to improve performance by not using the mutex in single threaded programs.
However, how do you detect that the program is single-threaded?
This is done very differently on different operating systems and even C-library versions. The basic idea is that if the `pthreads` library is present, then we assume that the program is multi-threaded and have to use the mutex. The gory details can be found here and the following 150 lines of code. It all comes down to the function
`__gthread_active_p`. The relevant implementation for us is here, this is the not FreeBSD and not Solaris version and looks like this:
#ifdef __GLIBC__
__gthrw2(__gthrw_(__pthread_key_create),
__pthread_key_create,
pthread_key_create)
# define GTHR_ACTIVE_PROXY __gthrw_(__pthread_key_create)
#elif defined (__BIONIC__)
# define GTHR_ACTIVE_PROXY __gthrw_(pthread_create)
#else
# define GTHR_ACTIVE_PROXY __gthrw_(pthread_cancel)
#endif
static inline int
__gthread_active_p (void)
{
static void *const __gthread_active_ptr
= __extension__ (void *) & GTHR_ACTIVE_PROXY;
return __gthread_active_ptr != 0;
}
It uses so-called “weak references” which are resolved at link time. In actual fact, the address of a library function is loaded and the linker replaces it by 0 if the library function is not in the executable. On GLibC systems, the (internal) function `__pthread_key_create` is used, which seems to be present if and only if `pthread_create` is. This seems to be reliable, since on GLibC the problem does not occur. On Android with the “Bionic” C-library, `pthread_create` is used directly, I do not know how reliably this works, since I did not try it. There is a comment above this piece of code saying that many libraries wrap `pthread_create` and thus the symbol could be there without `libpthreads` actually being linked. It seems that therefore, in the last case, which happens to be active with `libmusl` on Alpine Linux, the function `pthread_cancel` is used.
This actually works for non-statically linked executables! I assume that this is the case since if `pthread_create` is used, the whole shared library `libc.musl-x86_64` is linked and this includes all `pthread_*` functions. However, when I link statically with `-static`, the linker manages to put `pthread_create` in but leave out `pthread_cancel`, since I am not explicitly using it. C++ programs will indeed rarely use `pthread_cancel`, since this functionality is not covered by the standard template library (STL)!
So therefore, what actually happens is that on Alpine Linux, with `libmusl` as C-library, in a program which uses multiple threads but not `pthread_cancel` explicitly, and is statically linked, this mutex is not being used!
When exactly does it happen?
So when is `_Unwind_Find_FDE` actually called? I still have not understood all the fine details. However, what I have understood is this: When an exception is thrown, the C++ runtime has to figure out how to unroll the stack, execute all cleanup destructors and `catch` sections, and resume the execution at the right place. For performance reasons, nothing to help with this is done at runtime, on the grounds that exception handling should be “zero overhead” as long as no exceptions are actually thrown. However, the compiler does a lot of preparations at compile time. It basically creates for each “object” (whatever this actually is!), a list of “frame descriptor entries” (FDEs), which basically contain the information where in the binary code each frame ends and begins. For example, the C++ program:
{
C c;
try {
callSomething();
} catch (std::exception const& e) {
cleanup();
}
}
already contains multiple frames, I guess one for the `try` section, one for the `catch` section and one for the scope containing the local variable `c`. If an exception is thrown deep in `callSomething()`, all the C++ runtime has is the return address which points directly behind the call to `callSomething()`. It has to figure out that this is in the frame for the `try` and that now the `catch` frame has to be executed (or not, if the type does not match) and then do the cleanup for `c`.
In any case, all these data structures are prepared by the compiler, but also manipulated at runtime, and they reside in the `unseen_objects` and `seen_objects` lists. In all cases I have seen, there is a single object in `unseen_objects` at the beginning of `main()`, probably for the whole of the static executable, which contains lots of FDEs.
When the first exception is thrown in the life of the process, `_Unwind_Find_FDE` is called, and it will discover the single entry in `unseen_objects`. At this time, it is processed, the FDEs are potentially sorted, and the object is then added to the `seen_objects` list. I think this is some lazy initialization pattern, potentially squeezing a tiny bit of improvement for the process startup time.
In any case, if the mutex protecting all this is absent, and the first exception happening in the life of the process actually happens exactly at the same time in two threads, then we have a data race.
If things go wrong, the same (singleton) object is inserted twice into `seen_objects` and we have a shutdown busy loop. The process seems to run fine until then, although there might be situations in which the runtime data structures are destroyed, if there is more than one object in the list. However, I have not seen this happening in the wild.
To summarize:
If:
- we are on Linux,
- we are not using GLibC,
- we are not on Android with the Bionic C-library,
- we statically link the executable,
- we do not explicitly use `pthread_cancel`,
- the first exception that happens, happens in two threads exactly at the same time
- in one thread the execution is suspended at an unlucky time during the list manipulation in `_Unwind_Find_FDE`, one object,
and all of this happens together, we get a shutdown busy loop after the termination of the `main()` function.
Example program to show the bug
Here is a sample program exposing the problem (store this as `exceptioncollision.cpp` in the current directory):
#include <thread>
#include <atomic>
#include <chrono>
std::atomic<int> letsgo{0};
void waiter() {
size_t count = 0;
while (letsgo == 0) {
++count;
}
try {
throw 42;
} catch (int const& s) {
}
}
int main(int, char*[]) {
#ifdef REPAIR
try { throw 42; } catch (int const& i) {}
#endif
std::thread t1(waiter);
std::thread t2(waiter);
std::this_thread::sleep_for(std::chrono::milliseconds(10));
letsgo = 1;
t1.join();
t2.join();
return 0;
}
You can run Alpine Linux in a Docker container like this:
docker run -it --privileged -v $(pwd)/work:/work alpine:3.10 sh
Then install `g++` and `gdb`:
apk add g++ gdb
And compile the program as follows:
cd /work
g++ exceptioncollision.cpp -o exceptioncollision -O0 -Wall -std=c++14 -lpthread -static
Finally (even outside the Docker container), do:
while true ; ./exceptioncollision ; date ; done
and you will soon see the busy loop.
Use
cd /work
g++ exceptioncollision.cpp -o exceptioncollision -O0 -Wall -std=c++14 -lpthread -static -DREPAIR=1
to enable the “fix” described below.
I have submitted a bug report against `libgcc` to the `gcc` team here. This was at first closed quickly on the grounds that it is a problem of libmusl
. However, when I subsequently reported the problem to them on their mailing list, they quickly replied that the problem is actually in libgcc
after all, and the first bug report was reopened. Let’s see how this issue is resolved between the two teams
How to circumvent the problem
Once understood, there are different ways to circumvent the problem. The first obvious one seems to be to use `pthread_cancel` explicitly. This will then activate lots of thread-related functionality including the particular mutex we have been talking about. Unfortunately, at the time of this writing I am running in a different problem with this. An example program exposing the problem is here (store this as `exceptionbang.cpp` in the current directory):
#include <pthread.h>
#ifdef REPAIR
void* g(void *p) {
return p;
}
void f() {
pthread_t t;
pthread_create(&t, nullptr, g, nullptr);
pthread_cancel(t);
pthread_join(t, nullptr);
}
#endif
int main(int argc, char*[]) {
#ifdef REPAIR
if (argc == -1) { f(); }
#endif
try { throw 42; } catch(int const& i) {};
return 0;
}
Simply compile on Alpine Linux like this:
cd /work
g++ exceptionbang.cpp -o exceptionbang -O0 -Wall -std=c++14 -lpthread -static -DREPAIR=1
It will immediately crash. It turns out that other crucial thread related symbols must additionally be inserted to create strong references. One would at least need `pthread_once` additionally and then it seems to work. So this approach is really a temporary solution but no permament one.
The next idea is to simply throw an exception whilst still single threaded. This seems to work beautifully, this PR implements it for the `devel` branch of `ArangoDB`.
That is, a single line fix at the beginning of `main()` like this:
try { throw 42; } catch(int const&) {};
solves the problem. However, this is also no permanent solution, since the C++ runtime would continue to treat our program as single-threaded and who knows what other problems arise down the line.
What is a possible fix?
From the perspective of a developer of a multi-threaded application the optimization to skip a mutex in the single threaded case seems to be a bit over the top, considering the complication this causes. So one way would be to simply use the mutex in all cases, at least in this place, where the mutex never seems to be a serious (or even measurable) bottleneck.
However, the whole `gthreads` business is probably also used elsewhere, where performance is more important. Nevertheless, most applications nowadays use more than a single thread and so the easiest would be to switch off the whole optimization for single threaded applications, at least when running on libmusl
.
The second possibility would be to do the initialization of the FDE lists not in a lazy fashion, but eagerly at process start. This also seems to be a simplification of the code. It could, however, make the startup of a process slower in some cases, I guess. Maybe it is an important optimization to initialize only when the first exception actually happens, I do not know.
Which leaves the third idea of getting the recognition of single-threadedness right, even on `libmusl` and C++ programs which do not use `pthread_cancel` explicitly. This is probably the way to go. I am looking forward to hearing what the `gcc` developers suggest.
Lessons learned
In the course of the last few days I have learned a lot about the implementation of exceptions in C++ and `gcc`, about FDEs, code that runs before and after `main()`, about linking, weak references, different C-libraries, stack unwinding, `gdb`, assembler code and about the Linux `perf` tools. It has been fun and enlightening, but it also was a difficult investigation.
As lessons I take away:
- Do not over-optimize at the cost of unnecessary complexity.
- C++ runtime and exception handling is complicated.
- C++ exceptions can be quite expensive when actually thrown, since the optimization is for the common case that no exception is thrown.
- C-code can be hard to understand, in particular in the presence of lots of platform-specific special cases and preprocessor magic.
- The Linux `perf` tools are an incredible asset for debugging.
- Open source makes such an investigation possible in the first place.
- The teams developing these fundamental tools like
gcc
andlibmusl
are very responsive and actually help when problems arise.
Once again, I hope my findings will save you the journey of investigating the same problem for days. If you want to know more about multi-model or ArangoDB, I recommend reading the white paper I wrote about the topic. You can download it here.
2 Comments
Leave a Comment
Get the latest tutorials, blog posts and news:
[…] View Reddit by grmpf101 – View Source […]
This bug has been seen elsewhere and I’ve written it up here:
https://bugs.launchpad.net/gcc-arm-embedded/+bug/1905459
It will push the fix priority if you would log in above and not the bug also affects you.
Thanks,
Best Regards, Dave