Comparison: Lockless programming with atomics in C++ 11 vs. mutex and RW-locks
ArangoDB is multithreaded and able to use several CPU-cores at once. Because of that access to common data structures to these threads have to be protected from concurrent access. ArangoDB currently uses mutexes, spinlocks and RW-locks for that. With the ongoing development of the MVCC the number of situations where protected access is needed grows significantly. If locking is done too often the scalability is effectively limited to one core. So this test was done to estimate the costs, and evaluate other solutions – so called lockless programming with atomics.
“An implementation of phase-fair reader/writer locks”
We wrote a simple program to compare the different techniques. Since ArangoDB should have significantly more reader jobs and threads than writers the test pattern chosen was 49 reader threads and one writer. The test program uses an uint64_t variable, with all threads accessing it. Most use cases will increment the number (“write”), some will only put the loop counter into it (thus slightly different output of the counters, referenced as “set”). We configured all test cases to run one million loops. A concurrent start of all workers is desired. Threads spawned are held back by a C++ 11 construct. It’s the condition_variable so they can be started to do their load testing work at (roughly) the same point in time. The time measured is started from releasing the condition, and all worker threads rejoined the master.
Following are the platforms we used for the test:
- Windows (x64 Windows 8.1 with AMD Phenom II X2)
- Windows (x64 GRML Small with AMD Phenom II X2)
- MAC OS X (x64 Intel(R) Core(TM) i3 CPU 540 @3GHz)
- Linux ARM (Allwinner A20 SOC on Cubie Truck)
- Linux AMD64 (Intel Core2 Duo CPU E8500 @3.1GHz)
Since the idea was to get an overview of the different behaviors on the platforms, varying hardwares were chosen without virtualization. The windows computer was booted with GRML x64 small iso to revalidate whether the significantly different test results were hardware related.
Interested in trying out ArangoDB? Fire up your cluster in just few clicks with ArangoDB ArangoGraph: the Cloud Service for ArangoDB. Start your free 14-day trial here.
Methods tested
Here is a list of the implemented algorithms with explanations:
Read
: reads the current value and increases control counterWrite
: reads the old value, increases it and writes it backSet
: writes a loop counter into the protected variable
0: Unlocked
To get a reference to the costs of the locking, the probably race conditions prone flat increment & read was executed.
1: Mutexes
The classical way would be to protect the access to the variable by a mutex usind std::mutex. This is known to have much overhead.
2: Writelock / Writelock
These “should” be cheaper than mutexes. We wanted to see whats happening if you don’t differentiate between read / write. We do pthread wrlock / wrlock for the *nix-es, SRWLock Exclusive / Exclusive on windows
3: Writelock / Readlock
These “should” be cheaper than mutexes. We do pthread wrlock / rdlock for the *nix-es, SRWLock Exclusive / Shared on windows
4: Atomic Read & Write
The new kid in town since C++ 11: Also called lockless. Atomic is here to ensure no races are to be expected while accessing a variable. The possible syntax can result in a very compact code, at which you may not always be aware the ‘a++’ you’re reading is actually involves the overhead of atomic operations. The implementation is strongly dependent on the CPU architecture and its support.
5: Atomic Read “consume”, Atomic Set “release” for setting
This way the expenses of the atomic operation comes quicker to the code readers eye; its using the setter & getter methods. The test case doesn’t increment the int, it simply sets the counter as new value – thus the resulting numbers look different. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_consume for reading, std::memory_order_release for setting..
6: Atomic Read “acquire”, Atomic Set “release” for setting
This way the expenses of the atomic operation comes quicker to the code readers eye; its using the setter & getter methods. The test case doesn’t increment the int, it simply sets the counter as new value – thus the resulting numbers look different. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_acquire, std::memory_order_release for setting.
7: Atomic Read “consume”, Atomic Set “cst – consistent” for setting
This way the expenses of the atomic operation comes quicker to the code readers eye; its using the setter & getter methods. The test case doesn’t increment the int, it simply sets the counter as new value – thus the resulting numbers look different. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_consume, std::memory_order_seq_cst for setting.
8: Atomic Read “acquire”, Atomic Set “cst – consistent” for setting
This way the expenses of the atomic operation comes quicker to the code readers eye; its using the setter & getter methods. The test case doesn’t increment the int, it simply sets the counter as new value – thus the resulting numbers look different. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_acquire, std::memory_order_seq_cst for setting.
9: Atomic Read “consume”, Atomic Set “relaxed” for setting
This way the expenses of the atomic operation comes quicker to the code readers eye; its using the setter & getter methods. The test case doesn’t increment the int, it simply sets the counter as new value – thus the resulting numbers look different. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_consume, std::memory_order_relaxed for setting.
10: Atomic Read “acquire”, Atomic Set “relaxed” for setting
This way the expenses of the atomic operation comes quicker to the code readers eye; its using the setter & getter methods. The test case doesn’t increment the int, it simply sets the counter as new value – thus the resulting numbers look different. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_acquire, *std::memory_order_relaxed for setting.
11: Atomic Read “Acquire”, Atomic exchange weak for writing
This way the expenses of the atomic operation comes quicker to the readers eye; its using the setter & getter methods. This test case implements incrementing using while loops, as suggested by the documentation. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_acquire, std::memory_order_relaxed for setting.
12: Atomic Read “consume”, Atomic exchange weak for writing
This way the expenses of the atomic operation comes quicker to the readers eye; its using the setter & getter methods. This test case implements incrementing using while loops, as suggested by the documentation. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_consume, std::memory_order_relaxed for setting.
Test Results
Heres a table of the time in s the tests took to execute:
Testcase / OS | Windows 8.1 | GRML Linux X64 | Mac OS X | Linux ARM | Linux X64 |
---|---|---|---|---|---|
#0 | 0.033 | 0.0300119 | 0.02503 | 0.18299 | 0.02895 |
#1 | 479.878 | 5.7003600 | 118.47100 | 21.29970 | 4.47721 |
#2 | 1.45997 | 4.5296900 | 142.61700 | 23.29240 | 4.72051 |
#3 | 4.70791 | 6.3525200 | 7.65026 | 23.82040 | 7.87677 |
#4 | 0.056999 | 0.0454769 | 0.03771 | 0.94990 | 0.035302 |
#5 | 0.033999 | 0.0263720 | 0.02774 | 0.58076 | 0.017803 |
#6 | 0.032999 | 0.0286388 | 0.02785 | 0.58125 | 0.017604 |
#7 | 0.060998 | 0.0528450 | 0.03783 | 0.57926 | 0.033422 |
#8 | 0.060999 | 0.0536420 | 0.03807 | 0.57851 | 0.033546 |
#9 | 0.032999 | 0.0299869 | 0.02606 | 0.55443 | 0.017258 |
#10 | 0.033999 | 0.0235679 | 0.02593 | 0.54985 | 0.017839 |
#11 | 0.058999 | 0.0534279 | 0.06688 | 0.56929 | 0.030724 |
#12 | 0.059998 | 0.0676858 | 0.07563 | 0.57466 | 0.036788 |
We can see here that Windows delivers an extraordinary bad performance with std::Mutex; Dave explains why: It seems as if in windows std::mutex is similar to std::recursive_mutex using the Critical_Section system call which has a significant larger overhead and is to be avoided. SRW Locks are sufficient for ArangoDBs use of mutexes, and are therefore preferred. However, on Mac OS X, RW-Locks don’t deliver better performance, neither on Linux/ARM.
Performance wise one can clearly state that you should use linux if you want the biggest bang for the buck.
Implementation wise the conclusion is that one can’t use C++11’s std::Mutex as porting layer – depending on the requirements of the systems one should create ones own wrappers around locking mechanisms.
Atomic types deliver very good performance and if possible should be preferred to Mutexes / Locks. C++11 is delivering sufficiently persistent performance for these over all platforms in our test so one can lean on it. The performance of using all syntactic sugar (i.e. using the post increment operator on an atomic uint instance) of C++11 is equal to the writer/getter method. To visualize the overhead in the source code the instances should be prefixed like atomic_counter.
An interesting side note is, that one can see the overhead of Atomics on ARM is a little bit higher than on X86 (what we expected because of the weaker memory model).
Test Program
Most important – showing the means of measuring – heres the test program: (To compile in windows, create a default console project, paste in everything enable the stdafx.)
The code is available as gist.
It ran for the tests like this:
for i in `seq 0 20`; do ./a.out 50 1000000 50 $i; done
15 Comments
Get the latest tutorials, blog posts and news:
These results are really cool!.
Also, there must be a typo in the platforms used list: I don’t think there is such a thing as Windows GRML.
Am 16.02.15 um 18:14 schrieb Disqus:
Sorry, I apparently cannot read. I edited my comment accordingly.
Why not compare all permutations of these CPU’s and OS’s? Maybe it’s just AMD specific issue. Maybe it’s VS. The test is incomplete.
Anyway, I don’t think that using integer as an “operational variable” is the best idea here. Of course turning it into atomic will be faster than using mutex (but that’s not what mutex is for).
I’d be more interested in seeing performance difference of different mutex OR atomic implementations on different platforms, rather than this “oranges to apples” kind of thing.
The main intend of this comparison was to prove the time investment into
embracing atomics worthwhile then getting the prove how bad windows
performs on which cpu architecture.
I didn’t have another windows box available at that time. Since the source is available, I would love to see your testresults from other machines. Grml.org offers nice slink live cds to do the linux comparison.
Testing different platforms is important, so we know what’s best overall.
Tested on Windows 8.1 x64 on “antique” overclocked i7 860. VS 2013 Release Build x86 (4 core, 8 threads, 3.6 Ghz flat):
#0 0.03 // Depressing
#1 1.43 // Intriguing
#2 0.20 // More intriguing
#3 0.29
#4 0.19 // Wow
#5 0.19
Now we only need to test AMD on Linux. I don’t have one on my hands right now unfortunately.
try rebooting the machine with a live linux like knoppix or GRML.
It’s very interesing topic. I investigated this question a lot and once found an amazing article on different spinlock and RW-lock realizations: http://locklessinc.com/articles/locks/ which I strongly recommend. It contains speed tests for different realizations and different threads numbers.
Actually, the performance of different spinlocks is very dependent on the number of concurrent processes and CPU cores, used by application.
After a serie of experiments we came to a decision that hybrid of spinlok and sleep (or semaphores) was kinda optimal for our application. We could control the number of trials when spinlock tries to lock the resource and the sleep interval. This was the optimal decision for our case, but it;s all very dependent on the application design and desirable CPU load.
I found the locklessinc.com too after doing these tests; They offer a nice pthread.h for windows which may also work nice as porting layer:
http://locklessinc.com/articles/pthreads_on_windows/
Yes, on Windows it was always a problem. I used boost conditional variables (as quite fast method among other cross-platform solutions) and asm-based spinlocks for synchronizing threads back in times when I designed software for Windows. Now I deal mainly with Linux or Unix but still remember all the problems with timers and synchronizing in Windows and cross-platform decisions. And we also use ACE library: it’s cross-platform, offers locks, RW-locks that often work faster than boost, although having poor documentation.
Just as I side-note, most compilers at don’t really optimize for std::memory_order_consume and simply assume it’s the same as std::memory_order_acquire. See also http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
I ran the tests on an Intel Xeon E5-2630 running Windows 7 and got the following results:
VS 2012:
Duration: 0.012001
Duration: 205.906
Duration: 1.25512
Duration: 1.46515
Duration: 1.63016
Duration: 1.56016
Duration: 1.79218
Duration: 1.78818
Duration: 1.77418
Duration: 1.73417
Duration: 1.64416
Duration: 1.66317
Duration: 1.78618
VS 2015
Duration: 0.00999999
Duration: 2.52
Duration: 1
Duration: 1.17
Duration: 0.03
Duration: 0.00999999
Duration: 0.00999999
Duration: 0.02
Duration: 0.03
Duration: 0.00999999
Duration: 0.00999999
Duration: 0.02
Duration: 0.02
Note that the std::mutex is “fixed” and the speed overall is better in 2015.
Conclusion: Upgrade to VS2015 NOW!
Thanks, this looks indeed promising!
What units are the table in? Microseconds?
Afair it was seconds.
the gist with the source probably shows: https://gist.github.com/13abylon/523685ef6af5bde24dc3