Comparison: Lockless programming with atomics in C++ 11 vs. mutex and RW-locks

February 16 2015,/General

ArangoDB is multithreaded and able to use several CPU-cores at once. Because of that access to common data structures to these threads have to be protected from concurrent access. ArangoDB currently uses mutexes, spinlocks and RW-locks for that. With the ongoing development of the MVCC the number of situations where protected access is needed grows significantly. If locking is done too often the scalability is effectively limited to one core. So this test was done to estimate the costs, and evaluate other solutions – so called lockless programming with atomics.

Read more about our ongoing quest to optimized locking:
“An implementation of phase-fair reader/writer locks”

We wrote a simple program to compare the different techniques. Since ArangoDB should have significantly more reader jobs and threads than writers the test pattern chosen was 49 reader threads and one writer. The test program uses an uint64_t variable, with all threads accessing it. Most use cases will increment the number (“write”), some will only put the loop counter into it (thus slightly different output of the counters, referenced as “set”). We configured all test cases to run one million loops. A concurrent start of all workers is desired. Threads spawned are held back by a C++ 11 construct. It’s the condition_variable so they can be started to do their load testing work at (roughly) the same point in time. The time measured is started from releasing the condition, and all worker threads rejoined the master.

Following are the platforms we used for the test:

Windows (x64 Windows 8.1 with AMD Phenom II X2)
Windows (x64 GRML Small with AMD Phenom II X2)
MAC OS X (x64 Intel(R) Core(TM) i3 CPU 540 @3GHz)
Linux ARM (Allwinner A20 SOC on Cubie Truck)
Linux AMD64 (Intel Core2 Duo CPU E8500 @3.1GHz)

Since the idea was to get an overview of the different behaviors on the platforms, varying hardwares were chosen without virtualization. The windows computer was booted with GRML x64 small iso to revalidate whether the significantly different test results were hardware related.

Interested in trying out ArangoDB? Fire up your cluster in just few clicks with ArangoDB ArangoGraph: the Cloud Service for ArangoDB. Start your free 14-day trial here.

Methods tested

Here is a list of the implemented algorithms with explanations:

Read: reads the current value and increases control counter
Write: reads the old value, increases it and writes it back
Set: writes a loop counter into the protected variable

0: Unlocked

To get a reference to the costs of the locking, the probably race conditions prone flat increment & read was executed.

1: Mutexes

The classical way would be to protect the access to the variable by a mutex usind std::mutex. This is known to have much overhead.

2: Writelock / Writelock

These “should” be cheaper than mutexes. We wanted to see whats happening if you don’t differentiate between read / write. We do pthread wrlock / wrlock for the *nix-es, SRWLock Exclusive / Exclusive on windows

3: Writelock / Readlock

These “should” be cheaper than mutexes. We do pthread wrlock / rdlock for the *nix-es, SRWLock Exclusive / Shared on windows

4: Atomic Read & Write

The new kid in town since C++ 11: Also called lockless. Atomic is here to ensure no races are to be expected while accessing a variable. The possible syntax can result in a very compact code, at which you may not always be aware the ‘a++’ you’re reading is actually involves the overhead of atomic operations. The implementation is strongly dependent on the CPU architecture and its support.

5: Atomic Read “consume”, Atomic Set “release” for setting

This way the expenses of the atomic operation comes quicker to the code readers eye; its using the setter & getter methods. The test case doesn’t increment the int, it simply sets the counter as new value – thus the resulting numbers look different. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_consume for reading, std::memory_order_release for setting..

6: Atomic Read “acquire”, Atomic Set “release” for setting

7: Atomic Read “consume”, Atomic Set “cst – consistent” for setting

8: Atomic Read “acquire”, Atomic Set “cst – consistent” for setting

9: Atomic Read “consume”, Atomic Set “relaxed” for setting

10: Atomic Read “acquire”, Atomic Set “relaxed” for setting

11: Atomic Read “Acquire”, Atomic exchange weak for writing

This way the expenses of the atomic operation comes quicker to the readers eye; its using the setter & getter methods. This test case implements incrementing using while loops, as suggested by the documentation. Atomic Load is used to retrieve the value in the readers; It offers several heuristics; this test chooses the std::memory_order_acquire, std::memory_order_relaxed for setting.

12: Atomic Read “consume”, Atomic exchange weak for writing

New to multi-model and graphs? Check out our free ArangoDB Graph Course.

Test Results

Heres a table of the time in s the tests took to execute:

Testcase / OS	Windows 8.1	GRML Linux X64	Mac OS X	Linux ARM	Linux X64
#0	0.033	0.0300119	0.02503	0.18299	0.02895
#1	479.878	5.7003600	118.47100	21.29970	4.47721
#2	1.45997	4.5296900	142.61700	23.29240	4.72051
#3	4.70791	6.3525200	7.65026	23.82040	7.87677
#4	0.056999	0.0454769	0.03771	0.94990	0.035302
#5	0.033999	0.0263720	0.02774	0.58076	0.017803
#6	0.032999	0.0286388	0.02785	0.58125	0.017604
#7	0.060998	0.0528450	0.03783	0.57926	0.033422
#8	0.060999	0.0536420	0.03807	0.57851	0.033546
#9	0.032999	0.0299869	0.02606	0.55443	0.017258
#10	0.033999	0.0235679	0.02593	0.54985	0.017839
#11	0.058999	0.0534279	0.06688	0.56929	0.030724
#12	0.059998	0.0676858	0.07563	0.57466	0.036788

We can see here that Windows delivers an extraordinary bad performance with std::Mutex; Dave explains why: It seems as if in windows std::mutex is similar to std::recursive_mutex using the Critical_Section system call which has a significant larger overhead and is to be avoided. SRW Locks are sufficient for ArangoDBs use of mutexes, and are therefore preferred. However, on Mac OS X, RW-Locks don’t deliver better performance, neither on Linux/ARM.

Performance wise one can clearly state that you should use linux if you want the biggest bang for the buck.

Implementation wise the conclusion is that one can’t use C++11’s std::Mutex as porting layer – depending on the requirements of the systems one should create ones own wrappers around locking mechanisms.

Atomic types deliver very good performance and if possible should be preferred to Mutexes / Locks. C++11 is delivering sufficiently persistent performance for these over all platforms in our test so one can lean on it. The performance of using all syntactic sugar (i.e. using the post increment operator on an atomic uint instance) of C++11 is equal to the writer/getter method. To visualize the overhead in the source code the instances should be prefixed like atomic_counter.

An interesting side note is, that one can see the overhead of Atomics on ARM is a little bit higher than on X86 (what we expected because of the weaker memory model).

Test Program

Most important – showing the means of measuring – heres the test program: (To compile in windows, create a default console project, paste in everything enable the stdafx.)

The code is available as gist.

It ran for the tests like this:

for i in `seq 0 20`; do  ./a.out 50 1000000 50 $i; done

Willi Goesgens

Willi is part of ArangoDB Core team. He primarily works on performance tuning, the release process of ArangoDB and helping the community with his broad experience.

February 16 2015,Willi Goesgens

15 Comments

Guillermo on February 16 2015, at 6:14 pm

These results are really cool!.

Also, there must be a typo in the platforms used list: I don’t think there is such a thing as Windows GRML.
- luebbert42 on February 16 2015, at 6:23 pm
  
  Am 16.02.15 um 18:14 schrieb Disqus:
  - Guillermo on February 16 2015, at 6:39 pm
    
    Sorry, I apparently cannot read. I edited my comment accordingly.
L.A.J.W. on February 16 2015, at 6:35 pm

Why not compare all permutations of these CPU’s and OS’s? Maybe it’s just AMD specific issue. Maybe it’s VS. The test is incomplete.

Anyway, I don’t think that using integer as an “operational variable” is the best idea here. Of course turning it into atomic will be faster than using mutex (but that’s not what mutex is for).

I’d be more interested in seeing performance difference of different mutex OR atomic implementations on different platforms, rather than this “oranges to apples” kind of thing.
- Wilfried Gösgens on February 17 2015, at 11:24 am
  
  The main intend of this comparison was to prove the time investment into
  embracing atomics worthwhile then getting the prove how bad windows
  performs on which cpu architecture.
  I didn’t have another windows box available at that time. Since the source is available, I would love to see your testresults from other machines. Grml.org offers nice slink live cds to do the linux comparison.
  - L.A.J.W. on February 18 2015, at 12:10 am
    
    Testing different platforms is important, so we know what’s best overall.
    Tested on Windows 8.1 x64 on “antique” overclocked i7 860. VS 2013 Release Build x86 (4 core, 8 threads, 3.6 Ghz flat):
    #0 0.03 // Depressing
    #1 1.43 // Intriguing
    #2 0.20 // More intriguing
    #3 0.29
    #4 0.19 // Wow
    #5 0.19
    Now we only need to test AMD on Linux. I don’t have one on my hands right now unfortunately.
    - Wilfried Gösgens on February 19 2015, at 2:54 pm
      
      try rebooting the machine with a live linux like knoppix or GRML.
Iron Bug on February 17 2015, at 10:14 am

It’s very interesing topic. I investigated this question a lot and once found an amazing article on different spinlock and RW-lock realizations: http://locklessinc.com/articles/locks/ which I strongly recommend. It contains speed tests for different realizations and different threads numbers.

Actually, the performance of different spinlocks is very dependent on the number of concurrent processes and CPU cores, used by application.
After a serie of experiments we came to a decision that hybrid of spinlok and sleep (or semaphores) was kinda optimal for our application. We could control the number of trials when spinlock tries to lock the resource and the sleep interval. This was the optimal decision for our case, but it;s all very dependent on the application design and desirable CPU load.
- Wilfried Gösgens on February 17 2015, at 10:46 am
  
  I found the locklessinc.com too after doing these tests; They offer a nice pthread.h for windows which may also work nice as porting layer:
  http://locklessinc.com/articles/pthreads_on_windows/
  - Iron Bug on February 17 2015, at 10:54 am
    
    Yes, on Windows it was always a problem. I used boost conditional variables (as quite fast method among other cross-platform solutions) and asm-based spinlocks for synchronizing threads back in times when I designed software for Windows. Now I deal mainly with Linux or Unix but still remember all the problems with timers and synchronizing in Windows and cross-platform decisions. And we also use ACE library: it’s cross-platform, offers locks, RW-locks that often work faster than boost, although having poor documentation.
Yiannis Papadopoulos on February 17 2015, at 7:24 pm

Just as I side-note, most compilers at don’t really optimize for std::memory_order_consume and simply assume it’s the same as std::memory_order_acquire. See also http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
Marcus_FreeAB on December 14 2015, at 9:33 am

I ran the tests on an Intel Xeon E5-2630 running Windows 7 and got the following results:

VS 2012:

Duration: 0.012001
Duration: 205.906
Duration: 1.25512
Duration: 1.46515
Duration: 1.63016
Duration: 1.56016
Duration: 1.79218
Duration: 1.78818
Duration: 1.77418
Duration: 1.73417
Duration: 1.64416
Duration: 1.66317
Duration: 1.78618

VS 2015

Duration: 0.00999999
Duration: 2.52
Duration: 1
Duration: 1.17
Duration: 0.03
Duration: 0.00999999
Duration: 0.00999999
Duration: 0.02
Duration: 0.03
Duration: 0.00999999
Duration: 0.00999999
Duration: 0.02
Duration: 0.02

Note that the std::mutex is “fixed” and the speed overall is better in 2015.

Conclusion: Upgrade to VS2015 NOW!
- jsteemann on December 14 2015, at 9:57 am
  
  Thanks, this looks indeed promising!
William Baxter on March 10 2020, at 11:09 pm

What units are the table in? Microseconds?
Willi Goesgens on March 11 2020, at 10:14 am

Afair it was seconds.
the gist with the source probably shows: https://gist.github.com/13abylon/523685ef6af5bde24dc3