Atomics in C++ — What is a std::atomic?

Ryonald Teofilo
7 min readSep 26, 2023
Source: Reddit

The radioactive part of C++, std::atomic was added in C++11 as a way to declare an atomic variable — a variable whose value changes atomically, which means there is a guarantee that no other processes/threads would see any intermediary state.

Since C++11 introduced std::thread as a standard language-provided way to spawn threads, it had to provide means to achieve well-defined behaviour when multiple threads read or write to the same object. Therefore, it added thread synchronisation primitives such as std::mutex and atomic variables in std::atomic. We will be looking at the latter today, where we will discuss the concept of atomicity and what data types could be made atomic in C++.

What is an atomic operation?

An atomic operation is an operation guaranteed to execute as a single unified transaction.

When an atomic operation is executed on an object by a specific thread, no other threads can read or modify the object while the atomic operation is in progress. This means that other threads will only see the object before or after the operation — no intermediary state.

Let’s take a look at some code to understand this.

#include <iostream>
#include <thread>

int main()
{
int sum = 0;

auto f = [&sum](){
for(int i = 0; i < 1000000; i++)
{
sum += 1;
}
};

std::thread t1(f);
std::thread t2(f);

t1.join();
t2.join();

std::cout << sum << std::endl;

return 0;
}

In this code, we are incrementing sum on two threads without any synchronisation between them. One could naively assume that we would always output 2000000, because two threads are each incrementing it 1000000 times. If we compile and run this code, we find out that this is not true.

$ g++ --version
g++ (GCC) 11.4.0

$ g++ -dumpmachine
x86_64-pc-cygwin

$ g++ nonatomic.cpp -o app

$ ./app
1508328

$ ./app
1096773

$ ./app
1303703

Each execution prints a different output. This is obviously undesirable, but what is exactly happening here?

When sum is incremented by one thread, the value will need to be read from memory, modified (incremented) and written back into memory. In other words, it’s a read-modify-write operation.

Two threads updating sum non-atomically

Without any synchronisation, the two threads would read sum from memory as 0, increments, and write back 1 to memory. This means that whichever thread writes first will get its result stomped by the other thread.

This is known as a data race, which is undefined behaviour. Technically, anything could happen, including your program becoming sentient and colonising the human race. But in reality, as seen in our example, we may lose the result of some of the increments.

Atomic operations

It is important to note that on x86, reads and writes are atomic for built-in types. This may not be the case on other platforms, so doing non-atomic reads and writes to the same variable from separate threads means you could really see anything in your output. Undefined behaviour really means undefined behaviour!

However you may be wondering, if reads and writes on x86 are atomic, why is the output still inconsistent? This is because the increment (read-modify-write) operation as a whole is not atomic. This means that another thread could get a word in between the atomic read and write.

Firstly, we can guarantee the atomicity by declaring sum as an atomic variable. We can then split up the increment into an atomic read, followed by an atomic write as such

std::atomic<int> sum(0);

auto f = [&sum](){
for(int i = 0; i < 1000000; i++)
{
sum = sum + 1;
}
};

If we compile and run the code, as expected, we will still get inconsistent results.

$ g++ atomic.cpp -o app

$ ./app
1026549

$ ./app
1024982

At a lower level, the following may happen on an x86 machine:

Atomic read followed by an atomic write on each thread. This allows interleaving.
  1. Core 1’s cache atomically fetches sum from memory (sum == 0). Core 1 increments it to 1.
  2. Core 2’s cache then atomically fetches sum from memory (sum == 0). Core 2 increments it to 1.
  3. Core 2 writes back 1 to its cache, which trickles down to memory (sum == 1).
  4. Core 1 writes back 1 to its cache, which will also propagate down to memory, stomping on the value written by core 2 (sum == 1).

Again, we may lose the result of one of the increments.

In order to guarantee consistent results, we want to ensure the increment operation as a whole is atomic. We can update our code as such to achieve this.

auto f = [&sum](){
for(int i = 0; i < 1000000; i++)
{
sum++; // Same as sum+=1;
}
};

This is because std::atomic overloads operator+= and operator++ which atomically increments the value. The same could result could be achieved with fetch_add(). By ensuring the read-modify-write operation is done atomically, we achieve the desired behaviour.

$ g++ atomic.cpp -o app

$ ./app
2000000

$ ./app
2000000

Returning to the lower level:

  1. Core 1 acquires exclusive access (hardware) to sum.
  2. Core 1’s cache fetches sum from memory (sum == 0).
  3. Core 1 then increments the value to 1 and write to its cache, which trickles down to the main memory and releases the exclusive access.
  4. Core 2 will then do the same process and increment sum to 2.
Atomic read-modify-write on each thread. Guaranteed no increment lost.

There are other member functions to atomically read-modify-write, such as fetch_sub(). Though it is worth noting that these special operations are only available for integer types (note: raw pointers have increment/decrement too). Other types are limited to atomic reads and writes.

Although the overloaded operators are convenient, I would still recommend using the more verbose member functions, as the operators are more error-prone and less clear as our expressions get more complex.

For instance, x.fetch_add(1) would immediately tell another programmer that the variable is an atomic and we are doing increments atomically. Additionally, x+=1, x++ and x=x+1 would all effectively mean the same thing for non-atomic variables, but not for atomic variables as we’ve discussed.

Similarly, using x.load() and x.store(1) immediately suggests an atomic type, unlike x or x = 1 which is less explicit.

std::atomic<int> y(0);
int x = y.load(); // Equivalent to T x = y
y.store(x); // Equivalent to y = x

Atomic Types

Only trivially copyable types could be made atomic i.e. types that are copyable by copying its bits in memory or copyable with memcpy().

This means types that have virtual functions or are not stored in contiguous memory could not be made atomic.

Although these types can be made atomic, they are not necessarily created equal. Some atomics are implemented lock-free and some are not. This is platform-dependent, mainly due to alignment requirements for atomic instructions on said platform.

When an atomic is not lock-free, it is usually implemented with some type of mutex or other locking operation. But regardless of that, the member functions of a std::atomic remains the same.

This property can be evaluated at runtime through is_lock_free(). In C++17, we now have is_always_lock_free() which is retrievable at compile time, but it is a subset of is_lock_free() i.e. even if is_always_lock_free() returns false, is_lock_free() may still return true at runtime.

Again, the reason it could only be determined at runtime is due to memory alignment. Let’s take a look at some code to understand this.

#include <iostream>
#include <atomic>
#include <cstdint>

struct A {uint32_t a;}; // 4 bytes
struct B {uint64_t a;}; // 8 bytes
struct C {uint32_t a; uint32_t b; uint32_t c;}; // 12 bytes
struct alignas(16) D {uint32_t a; uint32_t b; uint32_t c;}; // 16 bytes
struct E {uint64_t a; uint64_t b;}; // 16 bytes
struct F {uint64_t a; uint64_t b; uint64_t c;}; // 24 bytes

int main()
{
std::atomic<A> a;
std::atomic<B> b;
std::atomic<C> c;
std::atomic<D> d;
std::atomic<E> e;
std::atomic<F> f;

std::cout << "Struct A: " << a.is_lock_free() << std::endl;
std::cout << "Struct B: " << b.is_lock_free() << std::endl;
std::cout << "Struct C: " << c.is_lock_free() << std::endl;
std::cout << "Struct D: " << d.is_lock_free() << std::endl;
std::cout << "Struct E: " << e.is_lock_free() << std::endl;
std::cout << "Struct F: " << f.is_lock_free() << std::endl;

return 0;
}

As compiled and ran here with GCC 6.4 x86_64, we see the output as follows.

Struct A: 1
Struct B: 1
Struct C: 0
Struct D: 1
Struct E: 1
Struct F: 0

It can be observed that alignment and padding matters!

It is also worth noting that I’ve compiled the source with an older version of GCC. This is done for demonstration purposes only, as GCC 7 and later has dropped inlining the double word compare-and-swap (CAS) instruction for x86_64 for reasons explored here! — TLDR, later versions of GCC calls libatomic instead of inlining the x86_64 16 byte CAS instruction for reasons including it being slow.

Hopefully that explains what std::atomic is, why it is important for your multi-threaded program, and what data types could be made atomic.

I will be posting a second part to this article soon, where we would be looking into the CAS operation, memory barriers and atomics in lock-free algorithms (available here now!).

Feel free to leave a comment if there are any doubts, or something you would like to add!

--

--