### Design of Parallel and High-Performance Computing

Fall 2015

Lecture: Cache Coherence & Memory Models

Motivational video: <u>https://www.youtube.com/watch?v=zJybFF6PqEQ</u>

# **Instructor:** Torsten Hoefler & Markus Püschel **TAs:** Timo Schneider

#### ETH

Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich





### Peer Quiz - Critical Thinking

#### Instructions:

- Pick some partners (locally) and discuss each question for 1 minute
  We then select a random student (team) to answer the question
- What is the top500 list? Discuss its usefulness (pro/con)!
   What should we change?
- What is the main limitation in single-core scaling today?
  - i.e., why do cores not become much faster?
  - What will be the next big problem/limit?
- What is the difference between UMA and NUMA architectures?
   Discuss which architecture is more scalable!
- Describe the difference between shared memory, partitioned global address space, and distributed memory programming
  - Name at least one practical example programming system for each
  - Why do all of these models co-exist?

### **Goals of this lecture**

- Memory Trends
- Cache Coherence
- Memory Consistency

### **Issues (AMD Interlagos as Example)**

#### How to measure bandwidth?

- Data sheet (often peak performance, may include overheads) Frequency times bus width: 51 GiB/s
- Microbenchmark performance
- Stride 1 access (32 MiB): 32 GiB/s Random access (8 B out of 32 MiB): 241 MiB/s
- Why?
- Application performance
  - As observed (performance counters)
  - Somewhere in between stride 1 and random access

#### How to measure Latency?

- Data sheet (often optimistic, or not provided) <100ns</li>
- Random pointer chase 110 ns with one core, 258 ns with 32 cores!



#### **Conjecture: Buffering is a must! Cache Coherence** Two most common examples: Different caches may have a copy of the same memory location! Write Buffers Cache coherence Delayed write back saves memory bandwidth Manages existence of multiple copies Data is often overwritten or re-read Cache architectures Caching Multi level caches Directory of recently used locations Multi-port vs. single port Stored as blocks (cache lines) Shared vs. private (partitioned) Inclusive vs. exclusive Write back vs. write through Victim cache to reduce conflict misses **Exclusive Hierarchical Caches Shared Hierarchical Caches** CPU<sub>0</sub> CPU CPU CPU CPU<sub>2</sub> CPU

L2

12



12

2

L2

12



#### **Cache Coherence Approaches** Invalidation vs. update Based on invalidation Invalidation-based: Broadcast all coherency traffic (writes to shared lines) to all caches Only write misses hit the bus (works with write-back caches) Subsequent writes to the same cache line are local Each cache snoops Invalidate lines written by other CPUs - Good for multiple writes to the same line (in the same cache) Signal sharing for cache lines in local cache to other caches Simple implementation for bus-based systems. Update-based: Works at small scale, challenging at large-scale All sharers continue to hit cache line after one core writes E.g., Intel Broadwell Implicit assumption: shared lines are accessed often Based on explicit updates Supports producer-consumer pattern well Central directory for cache line ownership Many (local) writes may waste bandwidth! Local write updates copies in remote caches Can update all CPUs at once Multiple writes cause multiple updates (more traffic) Hybrid forms are possible! Scalable but more complex/expensive E.g., Intel Xeon Phi KNC **MESI Cache Coherence** Terminology Most common hardware implementation of discussed requirements Clean line: aka, "Illinois protocol Content of cache line and main memory is identical (also: memory is up to date) Each line has one of the following states (in a cache): Can be evicted without write-back Modified (M) Dirty line: Local copy has been modified, no copies in other caches Content of cache line and main memory differ (also: memory is stale) Memory is stale Needs to be written back eventually Exclusive (E) Time depends on protocol details No copies in other caches Bus transaction: Memory is up to date A signal on the bus that can be observed by all caches Shared (S) Usually blocking Unmodified copies may exist in other caches Local read/write: Memory is up to date A load/store operation originating at a core connected to the cache Invalid (I) Line is not in cache Transitions in response to local writes Transitions in response to local reads State is M State is M No bus transaction No bus transaction State is E State is E No bus transaction No bus transaction Go to state M State is S No bus transaction State is S Line already local & clean State is I There may be other copies Generate bus read request (BusRd) Generate bus read request for upgrade to exclusive (BusRdX\*) May force other cache operations (see later) Go to state M Other cache(s) signal "sharing" if they hold a copy State is I If shared was signaled, go to state S Otherwise, go to state E Generate bus read request for exclusive ownership (BusRdX) Go to state M After update: return read value

## Transitions in response to snooped BusRd

- State is M
  - Write cache line back to main memory
  - Signal "shared"
  - Go to state S
- State is E
  - Signal "shared"
  - Go to state S and signal "shared"
- State is S
  - Signal "shared"
- State is I
- Ignore

### Transitions in response to snooped BusRdX

- State is M
  - Write cache line back to memory
  - Discard line and go to I
- State is E
  - Discard line and go to I
- State is S
  - Discard line and go to I
- State is I
- Ignore
- BusRdX\* is handled like BusRdX!



## **Small Exercise**

Initially: all in I state

| Action      | P1 state | P2 state | P3 state | Bus action | Data from |
|-------------|----------|----------|----------|------------|-----------|
| P1 reads x  |          |          |          |            |           |
| P2 reads x  |          |          |          |            |           |
| P1 writes x |          |          |          |            |           |
| P1 reads x  |          |          |          |            |           |
| P3 writes x |          |          |          |            |           |

## **Optimizations?**

Class question: what could be optimized in the MESI protocol to make a system faster?

## **Small Exercise**

#### Initially: all in I state

|             | P1 state | P2 state | P3 state | Bus action | Data from |
|-------------|----------|----------|----------|------------|-----------|
| P1 reads x  | E        | I        | I        | BusRd      | Memory    |
| P2 reads x  | S        | S        | I        | BusRd      | Memory    |
| P1 writes x | М        | I        | I        | BusRdX*    | Cache     |
| P1 reads x  | М        | I        | I        | -          | Cache     |
| P3 writes x | I        | I        | М        | BusRdX     | Memory    |

## Related Protocols: MOESI (AMD)

#### Extended MESI protocol

- Cache-to-cache transfer of modified cache lines
  - Cache in M or O state always transfers cache line to requesting cache
    No need to contact (slow) main memory
- Avoids write back when another process accesses cache line
   Good when cache-to-cache performance is higher than cache-to-memory E.g., shared last level cache!

#### Broadcasts updates in O state

Additional load on the bus

### **MOESI State Diagram**



### **Related Protocols: MOESI (AMD)**

#### Modified (M): Modified Exclusive

- No copies in other caches, local copy dirty
- Memory is stale, cache supplies copy (reply to BusRd\*)

#### Owner (O): Modified Shared

- Exclusive right to make changes
- Other S copies may exist ("dirty sharing")
- Memory is stale, cache supplies copy (reply to BusRd\*)
- Exclusive (E):
  - Same as MESI (one local copy, up to date memory)
- Shared (S):
  - Unmodified copy may exist in other caches
  - Memory is up to date unless an O copy exists in another cache
- Invalid (I):
  - Same as MESI

## Related Protocols: MESIF (Intel)

#### Modified (M): Modified Exclusive

- No copies in other caches, local copy dirty
- Memory is stale, cache supplies copy (reply to BusRd\*)
- Exclusive (E):
  - Same as MESI (one local copy, up to date memory)

#### Shared (S):

- Unmodified copy may exist in other caches
- Memory is up to date unless an F copy exists in another cache

#### Invalid (I):

Same as MESI

#### Forward (F):

- Special form of S state, other caches may have line in S
- Most recent requester of line is in F state
- Cache acts as responder for requests to this line

### **Multi-level caches**

#### Most systems have multi-level caches

- Problem: only "last level cache" is connected to bus or network
- Snoop requests are relevant for inner-levels of cache (L1)
- Modifications of L1 data may not be visible at L2 (and thus the bus)

#### L1/L2 modifications

- On BusRd check if line is in M state in L1
- It may be in E or S in L2!
- On BusRdX(\*) send invalidations to L1
- Everything else can be handled in L2
- If L1 is write through, L2 could "remember" state of L1 cache line
  - May increase traffic though

### Directory-based cache coherence

#### Snooping does not scale

- Bus transactions must be globally visible
- Implies broadcast
- Typical solution: tree-based (hierarchical) snooping

### Root becomes a bottleneck

- Directory-based schemes are more scalable
  - Directory (entry for each CL) keeps track of all owning caches
  - Point-to-point update to involved processors No broadcast
     Can use specialized (high-bandwidth) network, e.g., HT, QPI...



### **Directory-based CC: Write miss**

- P<sub>i</sub> intends to write, misses
- If dirty bit (in directory) is off
  - Send invalidations to all processors P<sub>i</sub> with presence[j] turned on
  - Unset presence bit for all processors
  - Set dirty bit
  - Set presence[i], owner P<sub>i</sub>
- If dirty bit is on
  - Recall cache line from owner P<sub>i</sub>
  - Update memory
  - Unset presence[j]
  - Set presence[i], dirty bit remains set
  - Supply data to writer

## Discussion

- Scaling of memory bandwidth
  - No centralized memory
- Directory-based approaches scale with restrictions
  - Require presence bit for each cache
  - Number of bits determined at design time
  - Directory requires memory (size scales linearly)
  - Shared vs. distributed directory

#### Software-emulation

- Distributed shared memory (DSM)
- Emulate cache coherence in software (e.g., TreadMarks)
- Often on a per-page basis, utilizes memory virtualization and paging

### **Open Problems (for projects or theses)**

#### Tune algorithms to cache-coherence schemes

- What is the optimal parallel algorithm for a given scheme?
- Parameterize for an architecture

#### Measure and classify hardware

- Read Maranget et al. "A Tutorial Introduction to the ARM and POWER Relaxed Memory Models" and have fun!
- RDMA consistency is barely understood!
- GPU memories are not well understood! Huge potential for new insights!

#### Can we program (easily) without cache coherence?

- How to fix the problems with inconsistent values?
- Compiler support (issues with arrays)?



