





- Multiple issue (000)
- Write buffer bypassing
- Nonblocking reads



## **Considerations**

- Define partial order on memory requests A → B
  - If P<sub>i</sub> issues two requests A and B and A is issued before B in program order, then  $A \rightarrow B$
  - A and B are issued to the same variable, and A is entered first, then A  $\rightarrow$  B (on all processors)
- These partial orders can be interleaved, define a total order Many total orders are sequentially consistent!
- Example:
- - P1: W(a), R(b), W(c)
  - P2: R(a), W(a), R(b)
  - Are the following schedules (total orders) sequentially consistent?
  - 1. P1:W(a), P2:R(a), P2:W(a), P1:R(b), P2: R(b), P1:W(c)
  - 2. P1:W(a), P2:R(a): P1: R(b), P2:R(b), P1:W(c), P2:W(a)
  - 3. P2:R(a), P2:W(a): P1: R(b), P1:W(a), P1:W(c), P2:R(b)

## Write buffer example

#### Write buffer

- Absorbs writes faster than the next cache → prevents stalls
- Aggregates writes to the same cache block ightarrow reduces cache traffic .





### **Relaxed Memory Models**

#### Sequential consistency

R→R, R→W, W→R, W→W (all orders guaranteed)

#### Relaxed consistency (varying terminology):

- Processor consistency (aka. TSO)
  - Relaxes W →R
- Partial write order (aka. PSO) Relaxes  $W \rightarrow R, W \rightarrow W$
- Weak Consistency and release consistency (aka. RMO) Relaxes  $R \rightarrow R$ ,  $R \rightarrow W$ ,  $W \rightarrow R$ ,  $W \rightarrow W$
- Other combinations/variants possible

#### Architectures Memory ordering in some architectures<sup>[2][3]</sup> SPARC SPARC Alpha ARMv7 RISC ×86 SPARC AMD64 IA-64 zSeries POWER Туре x86 RMO PSO TSO oostore Loads reordered after loads Loads reordered after Y stores Stores reordered after stores Stores reordered after Y loads Atomic reordered with loads Atomic reordered with stores Dependent loads reordered Incoherent Instruction Y Y cache pipeline Some older x86 and AMD systems have weaker memory ordering<sup>[4]</sup>

30

## **Case Study: Memory ordering on Intel**

- Intel® 64 and IA-32 Architectures Software Developer's Manual
  - Volume 3A: System Programming Guide
  - Chapter 8.2 Memory Ordering
  - http://www.intel.com/products/processor/manuals/

#### Google Tech Talk: IA Memory Ordering

- Richard L. Hudson
  - http://www.youtube.com/watch?v=WUfvvFD5tAA

## x86 Memory model: TLO + CC

#### Total lock order (TLO)

Instructions with "lock" prefix enforce total order across all processors

32

34

36

Implicit locking: xchg (locked compare and exchange)

#### Causal consistency (CC)

Write visibility is transitive

#### Eight principles

31

33

35

After some revisions <sup>©</sup>

## The Eight x86 Principles

- 1. "Reads are not reordered with other reads."  $(R \rightarrow R)$
- 2. "Writes are not reordered with other writes." ( $W \rightarrow W$ )
- 3. "Writes are not reordered with older reads." (R→W)
- "Reads may be reordered with older writes to different locations but not with older writes to the same location." (NO W→R!)
- 5. "In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility). (some more orders)
- 6. "In a multiprocessor system, writes to the same location have a total order." (implied by cache coherence)
- "In a multiprocessor system, locked instructions have a total order." (enables synchronized programming!)
- 8. "Reads and writes are not reordered with locked instructions. "(enables synchronized programming!)

## Principle 1 and 2

Reads are not reordered with other reads.  $(R \rightarrow R)$ Writes are not reordered with other writes.  $(W \rightarrow W)$ 



- If r1 == 2, then r2 must be 1!
- Not allowed: r1 == 1, r2 == 0
- Reads and writes observed in program order
- Cannot be reordered!

## Principle 3

Writes are not reordered with older reads. ( $R \rightarrow W$ )







#### **Principle 6** In a multiprocessor system, writes to the same location have a total order. (implied by cache coherence) All values zero initially P1 P2 P3 Ρ4 a=2 a=1 r1 = a r3 = a r4 = a r2 = a Not allowed: r1 == 1, r2 == 2, r3 == 2, r4 == 1 If P3 observes P1's write before P2's write, then P4 will also see P1's write before P2's write Provides some for of atomicity

## **Principle 7**

In a multiprocessor system, locked instructions have a total order. (enables synchronized programming!)

All values zero initially, registers r1==r2==1



- Not allowed: r3 == 1, r4 == 0, r5 == 1, r6 ==0
- If P3 observes ordering P1:xchg → P2:xchg, P4 observes the same ordering
- (xchg has implicit lock)

# eads and writes are not

**Principle 8** 

Reads and writes are not reordered with locked instructions. (enables synchronized programming!)

| P1                   | P                  | 2               |
|----------------------|--------------------|-----------------|
| xchg(a,r1)<br>r2 = b | xchg(b,r<br>r4 = a | <sup>-</sup> 3) |

- Locked instructions have total order, so P1 and P2 agree on the same order
- If volatile variables use locked instructions → practical sequential consistency

## An Alternative View: x86-TSO

 Sewell el al.: "x86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors", CACM May 2010

"[...] real multiprocessors typically do not provide the sequentially consistent memory that is assumed by most work on semantics and verification. Instead, they have relaxed memory models, varying in subtle ways between processor families, in which different hardware threads may have only loosely consistent views of a shared memory. Second, the public vendor architectures, supposedly specifying what programmers can rely on, are often in ambiguous informal prose (a particularly poor medium for loose specifications), leading to widespread confusion. [...] We present a new x86-TSO programmer's model that, to the best of our knowledge, suffers from none of these problems. It is mathematically precise (rigorously defined in HOL4) but can be presented as an intuitive abstract machine which should be widely accessible to working programmers. [...]"

## **Notions of Correctness**

We discussed so far:

39

41

- Read/write of the same location
- Cache coherence (write serialization and atomicity)
  Read/write of multiple locations
  Memory models (visibility order of updates by cores)

#### Now: objects (variables/fields with invariants defined on them)

- Invariants "tie" variables together
- Sequential objects
- Concurrent objects

38





## **Sequential Queue**





## Design by Contract!

#### Preconditions:

- Specify conditions that must hold before method executes
- Involve state and arguments passed
- Specify obligations a client must meet before calling a method

#### Example: enq()

Queue must not be full!

## class Queue {

void enq(Item x) { assert(tail-head < items.size()-1);

#### .... }

};

- head 1 5 7 6 5 4capacity = 8

tail



## **Sequential specification**

#### if(precondition)

Object is in a specified state

#### then(postcondition)

- The method returns a particular value or
- Throws a particular exception and
- Leaves the object in a specified state

#### Invariants

51

 Specified conditions (e.g., object state) must hold anytime a client could invoke an objects method!

50

## Advantages of sequential specification

#### State between method calls is defined

- Enables reasoning about objects
- Interactions between methods captured by side effects on object state

#### Enables reasoning about each method in isolation

- Contracts for each method
- Local state changes global state

#### Adding new methods

- Only reason about state changes that the new method causes
- If invariants are kept: no need to check old methods

### **Concurrent execution - State**

Concurrent threads invoke methods on possibly shared objects

At overlapping time intervals!





### **Concurrent execution - Method addition**

## Reasoning must now include all possible interleavings Of changes caused by and methods themselves

| Property                            | Sequential                                                                         | Sequential                                                          |                                                          | Concurrent                                                                              |  |
|-------------------------------------|------------------------------------------------------------------------------------|---------------------------------------------------------------------|----------------------------------------------------------|-----------------------------------------------------------------------------------------|--|
| Add Method                          | Without affecting other<br>methods; invariants on state<br>before/after execution. |                                                                     | Everything can potentially interact with everything else |                                                                                         |  |
| Consider                            | adding a me                                                                        | thod that return                                                    | s the last                                               | t item enqueued                                                                         |  |
| m peek() {<br>return items[(tail-1) | ) % items.size()];                                                                 | void enq(Item x) {<br>items[tail % items.s<br>tail = (tail+1)%items |                                                          | Item deq() {<br>Item item = items[head % items.size()]<br>head = (head+1)%items.size(); |  |

