Atomics in C++ — CAS and Memory Order

Ryonald Teofilo
10 min readOct 3, 2023
Source: Reddit

Continuing from the last post, we discussed atomic read, write and read-modify-write. However, we have yet to discuss arguably the most important and commonly used operation on a std::atomic, the compare-and-swap (CAS).

Compare-and-Swap

CAS would not be complete without the swap operation. This is available through the exchange() member function.

std::atomic<int> x(0);
int y = 42;
int x0 = x.exchange(y); // (x0 = x; x = y;) done atomically

Calling exchange() swaps the current value of the atomic to a desired value and returns the value it has replaced. All done atomically.

CAS adds on the atomic swap operation by adding an additional condition before doing the swap. Available through compare_exchange_strong() and compare_exchange_weak().

std::atomic<int> x(0);
int expected = x.load();
int desired = 42;

// If x == expected, x = desired and return true
// Otherwise, expected = x and return false
while(!x.compare_exchange_strong(expected, desired));

As illustrated in the code above, the method takes an expected value that it compares with the atomic variable. If they are indeed equal, the atomic will be swapped to the desired value and return true.

Otherwise, the current value of the atomic will be loaded to the expected variable (passed as a non-const reference), and return false. This mechanism allows us to create a retry loop without having to explicitly read the atomic again — This loop is what many lock-free algorithms revolve around.

The reason CAS would fail is contention with other threads. After we do the initial read of the atomic to get the expected value, we then have to ensure that other threads haven’t changed the atomic since we last read it. Assuming multiple threads are trying to CAS, the retry loop tries until our CAS beats other CASs to get our change in.

The following is an example of how CAS might be used in a lock-free list.

// List node
struct Node
{
int mValue{0};
Node* mNext{nullptr};
};

std::atomic<Node*> pHead(nullptr);

// Push a node to the front of the list
void PushFront(const int& value)
{
Node* newNode = new Node();
newNode->mValue = value;
Node* expected = pHead.load();

// Try push the node we have just created to the front of the list
do
{
newNode->mNext = pHead;
}
while(!pHead.compare_exchange_strong(expected, newNode));
}

Strong vs Weak

std::atomic provides two options for CAS, strong and weak.

compare_exchange_strong() is guaranteed to fail only when the current value of the atomic is no longer what we expected it to be, due to contention.

On the other hand, compare_exchange_weak() allows spurious failure. This means that even if the expected value is equal to the current value at the time of comparison, it can still return false.

But why would we want CAS to fail even if the current value is equal to the expected value? This is because on some platforms, acquiring exclusive access can be very expensive, but reading an atomic is often cheap (see cache coherency protocol — TLDR: In most cache coherency protocol, only one CPU is allowed to write to a cache line at a time, but many CPUs can read from a cache line simultaneously).

So instead of brutally demanding for exclusive access, a weak CAS will execute a timed-attempt to acquire the lock (hardware) which may time-out — similar to a network socket.

Pseudo-process between strong and weak CAS.

Weak CAS may yield better performance on some platforms, because it is often more effective to allow threads that are in a better position to carry out their update first, instead of trying really hard to get the exclusive access and waste precious memory cycles. For instance, another core may have the cache line containing the atomic local to them.

So when should one use weak/strong compare-and-swap?

This depends on the cost of spurious failures in your code. If failures mean you would just have to update a few variables and try again, weak CAS might give you better performance. On the contrary, strong CAS may be better if failures mean you would have to destroy and reconstruct an expensive object. But again, when it comes to performance, always measure!

It is important to note that there is no weak CAS on x86. All CAS operations on x86 are brute-force acquire and hold exclusive access until the CAS is completed.

Memory Order (Memory Barriers)

In the lock-free list shown earlier, we see that the atomic variable is used as a gateway to allow threads access to memory. This is how most lock-free algorithms use atomics. However it is important to be aware that the memory that it points to is not atomic!

Atomic used to reveal memory to other threads (seen in the lock-free list code earlier in the article).

Say we construct a new node on our thread and execute a CAS with the atomic pointer so it points at our new node. What guarantees that the other threads will see the new node in a constructed state in that memory address? This is where memory barriers provide guarantee on the visibility of non-atomic memory changes made by one CPU to another.

Memory barriers (otherwise known as std::memory_order in C++) act as a constraint, informing the compiler at compile time (see as-if rule) and hardware (CPU) at runtime to not reorder memory accesses a particular way.

Short Prerequisite on Hardware Reordering

a = 1; // Writes to cache
b = 2; // Writes to cache
...
// This may happen in any order at some point in time
b -> memory
a -> memory

When writing to variables in our source code as seen in the above example, we are not writing directly to memory (because that will be very slow!).

Instead, we write to our local cache, which will then propagate to main memory later at some point in time. However, the hardware is allowed to reorder the write to the main memory i.e. write b first, then a. This is known as hardware reordering.

In order to guarantee visibility of non-atomic changes across CPUs using a std::atomic (and allow optimisation opportunities for the compiler). C++ provides different levels of memory order, arranged from strictest to least strict as follows.

Note that memory_order_consume would not be discussed in this post, as its use is temporarily discouraged until the standard committee revises the spec.

std::memory_order_relaxed relaxed means no memory barriers i.e. no guarantees. This means reads and writes can be reordered by the compiler and hardware in any way. Only atomicity of the atomic operation is guaranteed.

Relaxed — all reads and writes can be reordered in any way, no guarantees!

This obviously is not ideal for thread synchronisation purposes, thus it is often used to increment some type of counter, where we only need atomicity guaranteed. For instance, incrementing the reference count for a smart pointer implementation e.g. std::shared_ptr.

std::memory_order_acquire the acquire barrier ensures all memory operations that are scheduled after the barrier become visible after the barrier. On the current thread, all reads and writes scheduled after the barrier cannot be reordered to before the barrier.

Acquire — all reads and writes can only be reordered from before the barrier, to after it.

std::memory_order_release release barrier is the opposite of acquire. It ensures all reads and writes that are scheduled before the barrier become visible before the barrier. On the current thread, all reads and writes scheduled before the barrier cannot be reordered to after the barrier.

Release — all reads and writes can only be reordered from after the barrier, to before it.

Acquire-Release order — acquire and release barriers are often used together to achieve acquire-release order.

If thread A stores to an atomic with a release barrier, and thread B loads the atomic with an acquire barrier, it is guaranteed that all writes made before the barrier in thread A will be visible after the barrier in thread B. In other words, thread A publishes its changes, and thread B acquires those changes. It is important to note that this guarantee only holds if thread A and B are referring the same atomic!

Acquire-release order is also used for locks, an example is this spinlock implementation using a std::atomic.

struct tas_lock
{
std::atomic<bool> lock_ = {false};

// Spins until we swap false -> true, acquire all the changes
void lock() { while(lock_.exchange(true, std::memory_order_acquire)); }

// Publishes changes
void unlock() { lock_.store(false, std::memory_order_release); }
};

std::memory_order_acq_relthe acquire release barrier combines both acquire and release barriers to ensure no memory operations can move across the barrier. Similar to the acquire-release ordering, this guarantee only holds if the threads are referring to the same atomic.

An example use case is decrementing the reference count in a smart pointer implementation. This is because we want to ensure that changes made to the control block in our thread are released to other threads, and also acquire changes from other threads to safely destroy the control block, in case we are the last reference to the object.

smart_ptr<T>::~smart_ptr()
{
// Release changes that we've made to the object
if(pn->refCount.fetch_sub(1, std::memory_order_acq_rel) == 0)
{
// Acquire all changes made in other threads to call dtor safely
...
}
}

std::memory_order_seq_cst as the strictest memory order, sequential consistency provides the same guarantee as acquire-release barrier but without the requirement of having to refer to the same atomic.

Unlike any of the previously mentioned memory barriers, sequentially consistent also guarantees a single total modification order of atomic variables that are flagged as such. This means that all threads are guaranteed to see the same order of memory operations on atomic variables.

To understand this behaviour, this snippet from cppreference demonstrates it nicely. Using std::memory_order_seq_cst we are guaranteed to never trigger the runtime assert because all threads see the changes made to the atomics in the same order. Any other memory order are not guaranteed to not trigger it.

std::atomic<bool> x = {false};
std::atomic<bool> y = {false};
std::atomic<int> z = {0};

void write_x()
{
x.store(true, std::memory_order_seq_cst);
}

void write_y()
{
y.store(true, std::memory_order_seq_cst);
}

void read_x_then_y()
{
while (!x.load(std::memory_order_seq_cst)){}
if (y.load(std::memory_order_seq_cst))
{
++z;
}

}

void read_y_then_x()
{
while (!y.load(std::memory_order_seq_cst)){}
if (x.load(std::memory_order_seq_cst))
{
++z;
}
}

int main()
{
std::thread a(write_x);
std::thread b(write_y);
std::thread c(read_x_then_y);
std::thread d(read_y_then_x);
a.join(); b.join(); c.join(); d.join();

// This assert will never trigger because thread c and d are
// guaranteed to see the changes in x and y in the same order
assert(z.load() != 0);
}

Even though total sequential ordering can be very expensive on some platforms (requires affected memory accesses to propagate to every core), it is the default memory order for operations on std::atomic as it is considered the least surprising and easiest to understand for most programmers, and rightfully so.

So, instead of avoiding std::memory_order_seq_cst like the plague, it is key to note that not all atomic operations need to be sequentially consistent.

Consider the fact that lock-based programming is usually sequentially consistent, but only needs the std::memory_order_acquire and std::memory_order_release.

Cost of Memory Barriers

The cost of memory barriers varies depending on the platform. On strong memory model platforms like x86, all stores are release stores, all loads are acquire-loads and all read-modify-write operations are acquire-release i.e. these are free. On weak memory model platforms like ARM, it can be really expensive.

It is worth noting that even if you are targeting x86, it is still important to specify the intended memory order, because it helps communicate your intention to other developers, and would forbid the compiler from reordering memory accesses that would cause unintended behaviour.

When to use atomics over locks?

One of the main incentives of using atomics over locks is that it is often faster when used to synchronise concurrent access to expensive data structures.

Lock-free data structures often outperform the same data structures that are expensive to guard with locks such as std::mutex. A good example is this MPMC queue. But always prove by measuring!

Another advantage of using atomics is that we get to avoid drawbacks that using locks might introduce, such as deadlocks that could be difficult debug.

However, if it is not obvious already, lock-free code is hard to write and even harder to read. So if you ever decide to tap into the C++’s magic stuff, make sure to specify your atomic operations and its memory order explicitly. This helps you to reason with your code, and convey your intentions to other programmers.

Hopefully that explains CAS and memory order in regard to std::atomic, and ensuring memory synchronisation using atomics when reading and writing on different threads.

Note that I have only touched the tip of the iceberg when it comes to the memory barriers and C++ memory model. I highly recommend watching this talk by Herb Sutter to really get into it.

Feel free to leave a comment if there are any questions or something you would like to add!

--

--