C++ Atomic Types and Operations

ISO/IEC JTC1 SC22 WG21 N2427 = 07-0297 - 2007-10-03

Hans-J. Boehm, Hans.Boehm@hp.com, boehm@acm.org
Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org

This document is a revision of N2393 = 07-0253 - 2007-09-10.


Introduction
    Rationale
    Goals
Discussion of Design
    A Library API
    Atomic Types, Operations, and Volatile
    C and C++ Interoperability
    Memory Ordering
    Consistency
    Fences
    Dependent Memory Order
    Lock-Free Property
    Flag Type and Operations
    Integral and Address Types
    Atomic Operations
    Generic Types
    Implementability
Prior Approaches
Changes to the Standard
    Chapter 17 Library introduction [library]
    Chapter 20 General utilities library [utilities]
    20.1 Concepts [utility.concepts]
    20.1.2 Comparisons [concept.comparison]
    20.1.5 Copy and move [concept.copymove]
    Chapter 29 Atomics library [atomics]
    29.1 Order and Consistency [atomics.order]
    29.2 Lock-Free Property [atomics.lockfree]
    29.3 Flag Type and Operations [atomics.flag]
    29.4 Integral and Address Types [atomics.types]
    29.5 Operations [atomics.operations]
    29.6 Generic Types [atomics.generic]
    29.7 Synopsis [atomics.synopsis]
Implementation
    Notes on the Presentation
    Implementation Files
    C++0x Features
    Memory Order
    Flag Type and Operations
    Implementation Macros
    Lock-Free Macro
    Integral and Address Types
        Boolean
        Address
        Integers
        Integer Typedefs
        Characters
    Template Types
        Fully Generic Type
        Pointer Partial Specialization
        Integral Full Specializations
    C++ Core Functions
    C Core Macros
    C++ Methods
    Implementation Header Cleanup
    Standard Headers
Examples of Use
    Flag
    Lazy Initialization
    Integer
    Event Counter
    List Insert
    Update
    Main

Introduction

We propose standard atomic types and operations. In addition to normative wording for the interface, we present a minimal conforming implementation and example uses of the interface.

Rationale

The traditional shared-memory notion that every store is instantly visible to all threads induces an unacceptable performance loss on modern systems. Therefore programmers must have a mechanism to indicate when stores in one thread should be communicated to another.

Specifically, a program that wishes to communicate the fact that a particular piece of data prepared by one thread is ready to be examined by another thread, needs a shared variable flag, that both:

Although the second aspect is often glossed over, it is usually not automatic with modern hardware and compilers, and is just as important as the first in ensuring correctness.

In POSIX, this communication is achieved by calling certain functions, particularly mutex lock and unlock. While mutexes are adequate for many applications, there is a significant need for finer-grained atomic operations:

Lock-free concurrent data structures
Lock-free concurrent data structures are difficult to design and implement correctly. As such, there is a significant need for programmers capable of doing so to write portably, and for that they need standards.
Inter-thread memory synchronization
Occasionally synchronization with locks offers insufficient performance, and other, more specific idioms suffice.

Goals

We had several goals for our design. Each had significant impact on the details.

Discussion of Design

While our proposal is based on existing practice, that practice is somewhat fragmented, so some design choices may not be obvious. In this section, we discuss our choices and their reasoning.

A Library API

We propose to add atomic types and operations purely as a library API. In practice, for C++, this API would have to be implemented largely with either compiler intrinsics or assembly code. (As such, this proposal should be implemented by compiler vendors, not library vendors, much as the exception facilities are implemented by compiler vendors.) For C, a compiler implementation is required for the type-generic macros.

Atomic Types, Operations, and Volatile

We propose atomic types rather than atomic operations on general-purpose types. Doing so enforces a single synchronization protocol for all operations on an object. Using two protocols simultaneously will result in synchronization failure.

We chose to specify atomic types with conventional type names rather than modify the volatile type qualifier for concurrency. The volatile type qualifier has a long history within C and C++, and changing its meaning is both risky and unnecessary. In addition, the existing meaning of volatile, "may be modified by external agents", is somewhat orthogonal to "may be modified concurrently by the program". See N2016 Should volatile Acquire Atomicity and Thread Visibility Semantics? for a more complete discussion. As a consequence, objects of atomic type may also be volatile qualified. Compilers may optimize some non-volatile atomic operations, where they would not be able to optimize volatile operations.

C and C++ Interoperability

The proposal defines the atomic types as C++ standard-layout structs. In C, these would be simply opaque types. Headers common to both C and C++ use exactly the same syntax to declare atomic variables.

Furthermore, the proposal defines the C++ atomic types such that the static initialization syntax can be identical to C aggregate initialization syntax. That is, atomic variables can be initialized in common headers as well.

The core atomic operations are free functions that take pointers to atomic types. Programmers use the same syntax for these operations in both C and C++. That is, a header included from both C and C++ can provide inline functions that operate on atomic types.

The proposal defines the core functions as overloaded functions in C++ and as type-generic macros in C. This approach helps programmers avoid changing code when an atomic type changes size.

Because free functions are occasionally clumsy, the proposal additionally provides member operators and member functions so that C++ programmers may use a more concise syntax.

Memory Ordering

Synchronization operations in the memory model (N2052, N2138, N2171, N2176, N2177, N2300, N2392) may be either acquire or release operations, or both. These operations govern the communication of non-atomic stores between threads. A release operation ensures that prior memory operations become visible to a thread performing subsequent acquire operation on the same object.

Rather than have atomic operations implicitly provide both acquire and release semantics, we choose to complicate the interface by adding explicit ordering specifications to various operations. Many comparable packages do not, and instead provide only a single version of operations, like compare-and-swap, which implicitly include a full memory fence.

Unfortunately, the extra ordering constraint introduced by the single version is almost never completely necessary. For example, an atomic increment operation may be used simply to count the number of times a function is called, as in a profiler. This requires atomicity, but no ordering constraints. And on many architectures (PowerPC, ARM, Itanium, Alpha, though not X86), the extra ordering constraints are at least moderately expensive.

The proposal defines an enumeration that enables detailed specification of the memory order for every operation.

Consistency

A strict interpretation of acquire and release yields a fairly weak consistency model, which allows threads to have a different notion of the order of writes. For stronger consistency, this proposal destinguishes between an operation with acquire and release semantics and an operation with sequentially consistent semantics. See N2177, Sequential Consistency for Atomics.

The member operators provide no mechanism for specifying memory order enumeration, and so they have only the strongest synchronization. This default is not an undue burden because using weaker synchronization requires considerable thought. Any extra syntactic burden is dwarfed by the semantic burden.

Program auditors can search for uses of the memory order enumeration to identify code that uses syncronization models that are weaker than sequentially consistent.

Fences

It is also unclear how a convention requiring full global memory fences would properly interact with an interface that supported simple atomic loads and stores. Here a full memory fence would generally multiply the cost by a large factor. (The current gcc interface does not seem to support simple atomic loads and stores explicitly, which makes it unclear how to support e.g. lock-based emulation, or architectures on which the relevant loads and stores are not implicitly atomic.)

There are two possible approaches to specifying ordering constraints:

Both approaches appear to have their merits. We chose the latter for reasons described in N2176.

Some architectures provide fences that are limited to loads or stores. We have, so far, not included them, since it seems to be hard to find cases in which both:

However, we have provided per-variable fence operations, which are semantically modeled as read-modify-write operations. Such fences enable programmers to conditinalize memory synchronization, which can substantially improve performance. See N2153 A simple and efficient memory model for weakly-ordered architectures for motivating examples and the Lazy Initialization example for an use of the proposed syntax.

We expect that implementations that have hardware fences will use such operations to implement the language fences. See N2362 Converting Memory Fences to N2324 Form, for a discussion of the issues and approaches to converting from existing practice of global fences to the per-variable fences in this proposal.

Dependent Memory Order

Most architectures provide additional ordering guarantees if one memory operation is dependent on another. In fact, these are critical for efficient implementation of languages like Java.

For C++, there is near-universal agreement that it would be nice to have some such guarantees. The fundamental issues are:

See papers N2359 C++ Dependency Ordering: Atomics, N2360 C++ Dependency Ordering: Memory Model, and N2361 C++ Dependency Ordering: Function Annotation for exploration of these issues.

Lock-Free Property

In some cases, both the decision to use a lock-free algorithm, and sometimes the choice of lock-free algorithm, depends on the availability of underlying hardware primitives. In other cases, e.g. when dealing with asynchronous signals, it may be important to know that operations like compare-and-swap are really lock-free, because a lock-based emulation might result in deadlock.

The proposal defines feature queries to determine whether or not operations are lock-free. We provide two kinds of feature queries, compile-time preprocessor macros and run-time functions. To facilitate optimal storage use, the proposal supplies feature macros that indicates general lock-free status of integral and address atomic types. The run-time function provide per-object information.

The proposal provides run-time lock-free query functions rather than compile-time constants because subsequent implementations of a platform may upgrade locking operations with lock-free operations, so it is common for systems to abstract such facilities behind dynamic libraries, and we wish to leave that possiblility open. Furthermore, we recommend that implementations without hardware atomic support use that technique.

The proposal provides lock-free query functions on individual objects rather than whole types to permit unavoidably misaligned atomic variables without penalizing the performance of aligned atomic variables.

Because consistent use of operations requires that all operations on a type must use the same protocol, all operations are lock-free or none of them are. That is, the lock-free property applies to whole objects, not individual operations.

To facilitate inter-process communication via shared memory, it is our intent that lock-free operations also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation shall not depend on any per-process state. While such a definition is beyond the scope of the standard, a clear statement of our intent will enable a portable expression of class of a programs already extant.

Flag Type and Operations

The proposal includes a very simple atomic flag type, providing two primary operations, test-and-set and clear. This type is the minimum hardware-implemented type needed to conform to this standard. The remaining types can be emulated with the atomic flag, though with less than ideal properties. Few programmers should be using this type.

We considered adding a wait operation in this proposal, but ultimately rejected it because pure busy waiting has pathological performance characteristics on multi-programmed machines and because doing better requires interaction with the operating system scheduler, which seems inappropriate to a processor-level facility.

Integral and Address Types

The proposal includes a full set of integral atomic types and an address atomic type.

This proposal includes atomic integral types smaller than a machine word, even though many architectures do not have such operations. For machines that implement a word-based compare-and-swap operation, the effect of operations can be achieved by loading the containing word, modifying the sub-word in place, and performing a compare-and-swap on the containing word. In the event that no compare-and-swap is available, the implementation may need to either implement smaller types with larger types or use locking algorithms. Using the same size for the atomic type as its base type eases effort required to port existing code, e.g. code using __sync.

Atomic Operations

The simplest atomic operations are load and store.

There are times when one wishes to store a new value, but also load the old value. An atomic load followed by an atomic store is not an atomic sequence, so we provide an atomic swap operation that does a combined load and store atomically.

We also provide the common fetch-and-modify operations. The fetch-and-modify functions return the original stored value. The original stored value is required for fetch-and-or and fetch-and-and because there is no means to compute to the original stored value from the new value and the modifying argument. In contrast to the functions, the fetch-and-modify assignment operators return the new value. We do this for consistency with normal assignment operators. Unlike normal C++ assignment operators, though, the atomic assignments return values rather than references, which is like C. The reason is that another thread might intervene between an assignment and a subsequent read. Rather than introduce this classic parallel programming bug, we return a value.

The normal signed integral addition operations have undefined behavior in the event of an overflow. For atomic variables, this undefined behavior introduces significant burden because there is in general no way to pre-test for overflow. Rather than leave overflow undefined, we recognize the defacto behavior of modern systems and define atomic fetch-and-add (-subtract) to use two's-complement arithmetic. We are aware of no implementation of C++ for which this definition is a problem.

The compare-and-swap operation seems to be the minimum general-purpose synchronization primitive. It appears to be both necessary and sufficient for most interesting lock-free algorithms. The compare-and-swap operations may fail spuriously. That is, the stored value may be known to be equal to the expected value, and yet the operation fails to swap. We permit this failure for two reasons. First, it appears unavoidable for load-locked store-conditional implementations. Second, hardware memory switches may prefer to fail some operations to ensure overall machine performance.

The compare-and-swap operations replace the expected value with the found value in the event that the found value does not match the expected value. The intent of the design is to set up a compare-and-swap loop for the next iteration. The reason for this design is that:

Unlike other operations, the compare-and-swap operations have two memory synchronization order parameters. The first is for operations that succeed; the second is for those that fail. The reason for this pair of specifications is that there are several circumstances in which the failure case can have substantially weaker, and substantially cheaper, synchronization. However, for sequential consistency, full synchronization is required.

Generic Types

Generic atomic types provide significant benefits:

The generic atomic type has a partial specialization for pointers (derived from the atomic address type), and full specializations for integral types (derived from the atomic integrals).

The primary problem with a generic atomic template is that effective use of machine operations requires that their parameter types are trivially copy assignable and bitwise equality comparable.

Furthermore, parameter types that are not also statically initializable and trivially destructable may be difficult to use.

In the present language, there is no mechanism to require these properties of a type parameter. Roland Schwarz suggests using a template union to enfore POD type parameters. Unfortunately, that approach also prevents the derivation of specializations of atomic for the types above, which is unacceptable. Furthermore, Lois Goldthwaite proposes generalizing unions to permit non-POD types in N2248 Toward a More Perfect Union. We believe that concepts are a more appropriate mechanism to enforce this restriction, and so we have proposed the concepts and concept maps necessary for safe use of generic atomics. The burden on programmers using generic atomics is to declare that types passes as template parameters are in fact bitwise equality comparable. The proposed mechanisms will infer trivially copy assignable types.

The intent is that vendors will specialize a fully-general locking implementation of a generic atomic template with implementations using hardware primitives when those primitives are applicable. This specialization can be accomplished by defining a base template with the size and alignment of the parameter type as additional template parameters, and then specializing on those two arguments.

Implementability

We believe that there is ample evidence for implementability.

The Intel/gcc __sync intrinsics provide evidence for compiler implementability of the proposed interface.

(We do not advocate standardizing these intrisics as is. They provide far less control over memory ordering than we advocated above. For example, they provide no way to atomically increment a counter without imposing unnecessary ordering constraints. The lack of appropriate ordering control appears to already have resulted in implementation shortcuts, e.g. gcc does not implement __sync_synchronize() as a full memory barrier on X86, in spite of the documentation. We believe a number of issues were not fully understood when that design was developed, and it could could greatly benefit from another iteration at this stage.)

Other packages, particularly Boehm's atomic_ops package provide evidence of efficient implementability over a range of architectures.

The remaining implementation issue is the burden on implementors to produce a minimally conforming implementation. The minimum hardware support required is the atomic test-and-set and clear operations that form the basis of the atomic flag type. This proposal includes an example implementation based on that minimal hardware support, and thus shows the vendor work required.

Prior Approaches

This proposal results from experience with many approaches. For brevity, we address only those approaches with formal proposals before the C++ standards committee.

N1875 presented an atomic statement, which would require the compiler to select the appropriate atomic operations. We chose to abandon this approach because:

N2047 presented a template-based C++ library of atomic types layered on top of a C library of atomic operations on plain integral types. We chose to abandon this approach because:

N2145 introduced the basic approach adopted in this proposal.

N2153 proposes a fence-based memory model.

N2195 presents a lower level interface than that of N2145. N2195 deviates from the previous proposals in the following major areas:

N2334 derives from N2145, but was refined in several areas.

N2381 is primarily a set of editorial changes to N2324, but with the following substantive changes.

N2393 is a substantial editorial rewrite of N2381, with primary emphasis on providing normative wording. Changes are generally not substantive, but with the following exceptions.

N2427 is a minor change to N2393 correcting a weakness in the interaction between concepts, booleans, and hardware compare-and-swap.