Understanding Memory Order

2022-09-07

Both C++11 and C11 introduced memory order-related features in their new standards, driven by the demand arising from SMP and larger caches. In multi-core processors, although the cores share main memory (excluding NUMA here), their caches are independent. Consequently, modern CPUs resemble distributed systems. Memory order serves as a crucial synchronization mechanism within this distributed system.

The memory order in the new standards includes:

Acquire
Release
Acquire-Release (in short "acq-rel")
Sequentially-consistent (in short "seq-cst")
Relaxed

There's also a "consume" order, but mainstream compilers haven't implemented it; it's automatically treated as acquire. Thus, "consume" can be omitted.

Let's start by explaining acquire and release. These operations derive their names from acquiring and releasing mutex locks. To help understanding, we can analogize using Git. The computer's main memory can be likened to a central Git repository, such as GitHub or GitLab, while the caches of each CPU can be seen as local Git repositories distributed across different locations.

When a thread on one CPU core reads from or writes to memory, it may actually be accessing the cache, and these read/write operations (load/store) may not immediately synchronize with the main memory. Meanwhile, another thread on another CPU core may also be accessing the same memory location, leading to inconsistencies, akin to Git conflicts. In Git operations, we can manually resolve conflicts, but CPUs, being rapidly changing, lack such mechanisms. Once a conflict arises, it's typically resolved based on a first-come, first-served basis, potentially leading to race conditions, and in severe cases, the process will crash.

Acquire and release resemble pull and push. Acquire, similar to git pull, fetches the main memory state before the acquire operation, ensuring that the current cache is up-to-date. Release, akin to git push, synchronizes the modifications made to the cache with the main memory. Acq-rel, as its name suggests, is for atomic operations involving read-then-write. It will pull before reading and push after writing. Through atomic operations, critical operations can be ensured not to conflict, preventing race conditions.

However, in architectures like x86/x64, all atomic operations are acquire-release, known as strong memory model. Conversely, ARM architectures have a weak memory model. But even in x86/x64, we must be cautious as compilers may apply aggressive optimizations, potentially reordering load/store operations. Acquire and release can inform the compiler that reordering is not allowed in those specific locations. For instance, because acquire ensures all reads and writes before it are synchronized to the main memory, reads and writes after acquire cannot be reordered before it. Similarly, writes and reads before a release operation cannot be reordered after it.

A typical scenario for using acquire-release is reference counting in smart pointers. When a smart pointer goes out of scope, the reference count is decremented by 1; if it reaches zero, memory needs to be deallocated. Here, acquire-release is necessary. First, because memory deallocation may be needed, it's crucial to ensure that operations on other CPU cores are visible to this thread to ensure proper deallocation. Second, after deallocation, other threads need to be informed about it. Simultaneously, it's necessary to prohibit the compiler from optimizing or reordering read/write sequences arbitrarily. Hence, acquire-release is a must.

Sequentially-consistent (seq-cst) is even more rigorous than acquire-release. It not only synchronizes with the main memory but also ensures that all CPU cores' caches are updated, achieving effects similar to single-core systems, albeit at the cost of slower speed. To avoid developers' confusion, seq-cst is the default behavior for atomic operations in C++11. In contrast, release doesn't guarantee that all threads will see the modifications made by the current thread; other threads can only ensure they see modifications before release when they do acquiring operations.

On the other hand, relaxed doesn't involve synchronization at all; it only ensures that the read/write operations on the variable undergoing the atomic operation are atomic. Relaxed is commonly used in incrementing reference counts because acquiring a reference will never trigger destruction and free operations, thus preventing double free and leaks.

Atomic operations and memory order, however, remain risky and should be used with extreme caution. If possible, it's preferable to use communication instead of sharing. If not, in most cases, mutexes and condition variables are still sufficient. Using atomic operations and memory order should be the last resort.

Understanding Memory Order

See Also