



TH, R. Belli: Scientific Benchmarking of Parallel Computing Systems, IEEE/ACM SC15 (full talk at https://www.youtube.com/watch



Scientific Benchmarking: Pitfalls of Relative Performance Reporting (Rule 1)



soclinf.ethz.ch y Bapel\_eth ETHZÜRICH

The only difference is the baseline.

## \*\*\*SPCL

- Goals of this lecture
- Memory Trends Short Refresher on Locality and Caches!
- Cache Coherence in Multiprocessors
- Advanced Memory Consistency

TH, R. Belli: Scientific Benchmarking of Parallel Computing Systems, IEEE/ACM SC18











- Supports producer-consumer pattern well
- Many (local) writes may waste bandwidth!

Hybrid forms are possible!

Memory is up to date
Shared (S)

- Unmodified copies may exist in other caches
- Memory is up to date
   Invalid (I)
- Line is not in cache

| **SPCL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | yetageteså<br>y død en ETH zürich                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Terminology                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Transitions in response to local reads                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| <ul> <li>Clean line:         <ul> <li>Content of cache line and main memory is identical (also: memory is up to date)</li> <li>Can be evicted without write-back</li> </ul> </li> <li>Dirty line:         <ul> <li>Content of cache line and main memory differ (also: memory is stale)</li> <li>Needs to be written back eventually<br/><i>Time depends on protocol details</i></li> </ul> </li> <li>Bus transaction:         <ul> <li>A signal on the bus that can be observed by all caches</li> <li>Usually blocking</li> </ul> </li> <li>Local read/write:         <ul> <li>A load/store operation originating at a core connected to the cache</li> </ul> </li> </ul> | <ul> <li>State is M <ul> <li>No bus transaction</li> </ul> </li> <li>State is E <ul> <li>No bus transaction</li> </ul> </li> <li>State is S <ul> <li>No bus transaction</li> </ul> </li> <li>State is I <ul> <li>Generate bus read request (BusRd)<br/><i>May force other cache operations (see later)</i></li> <li>Other cache(s) signal "sharing" if they hold a copy</li> <li>If shared was signaled, go to state S</li> <li>Otherwise, go to state E</li> </ul> </li> <li>After update: return read value</li> </ul> |
| **SPEL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Transitions in response to local writes                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Transitions in response to snooped BusRd                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| <ul> <li>State is M <ul> <li>No bus transaction</li> </ul> </li> <li>State is E <ul> <li>No bus transaction</li> <li>Go to state M</li> </ul> </li> <li>State is S <ul> <li>Line already local &amp; clean</li> <li>There may be other copies</li> <li>Generate bus read request for upgrade to exclusive (BusRdX*)</li> <li>Go to state M</li> </ul> </li> <li>State is I <ul> <li>Generate bus read request for exclusive ownership (BusRdX)</li> </ul> </li> </ul>                                                                                                                                                                                                       | <ul> <li>State is M</li> <li>Write cache line back to main memory</li> <li>Signal "shared"</li> <li>Go to state S (or E)</li> <li>State is E</li> <li>Signal "shared"</li> <li>Go to state S and signal "shared"</li> <li>State is S</li> <li>Signal "shared"</li> <li>State is I</li> <li>Ignore</li> </ul>                                                                                                                                                                                                             |

\*\*\*SPCL

Transitions in response to snooped BusRdX

\*\*\*SPEL

State is M
Write cache line back to memory
Discard line and go to I --- BusRdX/ Flush State is E . BusRd/ Flush Discard line and go to I State is S
Discard line and go to I State is I Ignore PrWr/ BusRdX BusRdX\* is handled like BusRdX!

specinf.eth.ch y ⊗speci\_eth ETH ZÜRICH spel.inf.ethz.ch ∳ @spel\_eth ETHZÜRICh MESI State Diagram (FSM) PrRd/-PrRd/-PrWr/-М Wr/ \_\_\_\_\_BusRd/-BusRdX/ Flush PrRd/-BusRd(S) PrWr/ BusRdX BusRdX/Flush - – PrRd/BusRd(S) PrRd/-

| Small Exercise                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | Small Exercise                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Initially: all in I state                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Initially: all in I state                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| Action P1 state P2 state P3 state Bus action Data from                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | Action P1 state P2 state P3 state Bus action Data from                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| P1 reads x<br>P2 reads x                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | P1 reads x E I I BusRd Memory<br>P2 reads x S S I BusRd Cache                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| P1 writes x                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | P1 writes x M I I BusRdX* Cache                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| P1 reads x P3 writes x                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | P1 reads x M I I - Cache<br>P3 writes x I I M BusRdX Memory                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| Coprimizations?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Related Protocols: MOESI (AMD)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Class question: what could be optimized in the MESI protocol to make a system faster?                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | <ul> <li>Extended MESI protocol</li> <li>Cache-to-cache transfer of modified cache lines</li> <li>Cache in M or O state always transfers cache line to requesting cache</li> <li>No need to contact (slow) main memory</li> <li>Avoids write back when another process accesses cache line</li> <li>Good when cache-to-cache performance is higher than cache-to-memory <i>E.g., shared last level cache!</i></li> </ul>                                                                                                                                                                                                                                |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | SPCL SPCL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| MOESI State Diagram                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Related Protocols: MOESI (AMD)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Number     Number       Number     Number | <ul> <li>Modified (M): Modified Exclusive</li> <li>No copies in other caches, local copy dirty</li> <li>Memory is stale, cache supplies copy (reply to BusRd*)</li> <li>Owner (0): Modified Shared</li> <li>Exclusive right to make changes</li> <li>Other S copies may exist ("dirty sharing")</li> <li>Memory is stale, cache supplies copy (reply to BusRd*)</li> <li>Exclusive (E):</li> <li>Same as MESI (one local copy, up to date memory)</li> <li>Shared (S):</li> <li>Unmodified copy may exist in other caches</li> <li>Memory is up to date unless an O copy exists in another cache</li> <li>Invalid (I):</li> <li>Same as MESI</li> </ul> |

| ····SPEL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | spcLinf.ethz.ch<br>∳®spcl_eth ETHZÜRICH     | ***SPCL                                                                                                                                                                                                                    |                                                                                                                                                                                                                                   | Service                       | spci.lnf.ethz.ch<br>¥®spci_eth ETH |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|------------------------------------|
| Related Protocols: MESIF (Intel)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                                             | Multi-level cache                                                                                                                                                                                                          | 25                                                                                                                                                                                                                                |                               |                                    |
| Modified (M): Modified Exclusive         • No copies in other caches, local copy dirty         • Memory is stale, cache supplies copy (reply to BusRd*)         Exclusive (E):         • Same as MESI (one local copy, up to date memory)         Shared (S):         • Ummodified copy may exist in other caches         • Memory is up to date         Invalid (I):         • Same as MESI         Forward (F):         • Special form of S state, other caches may have line in S         • Most recent requester of line is in F state         • Cache acts as responder for requests to this line |                                             | <ul> <li>Yet, snoop requests</li> <li>Modifications of L1</li> <li>L1/L2 modifications</li> <li>On BusRd check if lin</li> <li>It may be in E or S in</li> <li>On BusRdX(*) send i</li> <li>Everything else can</li> </ul> | level cache" is connected to bus or<br>are relevant for inner-levels of cac<br>data may not be visible at L2 (and t<br>ne is in M state in L1<br>121<br>invalidations to L1<br>be handled in L2<br>h, L2 could "remember" state o | he (L1)<br>hus the bus)       |                                    |
| SPEL<br>Directory-based cache coherence                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | sociofathad<br>y ©gad_ath <b>ETH</b> ZÜRİCH | Basic Scheme                                                                                                                                                                                                               |                                                                                                                                                                                                                                   |                               | speinfetbuch<br>¥ Bopd,etb         |
| Snooping does not scale <ul> <li>Bus transactions must be globally visible</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                                             | System with N proce                                                                                                                                                                                                        | essors P <sub>i</sub>                                                                                                                                                                                                             | P <sub>0</sub> P <sub>1</sub> | P <sub>2</sub>                     |
| Implies broadcast                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                                             | For each memory block                                                                                                                                                                                                      |                                                                                                                                                                                                                                   | Cache Cache                   | Cache                              |
| Typical solution: tree-based (hierarchical) snooping     Root becomes a bottleneck                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                             | maintain a directory                                                                                                                                                                                                       |                                                                                                                                                                                                                                   | X = 0                         |                                    |
| Directory-based schemes are more scalable                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                             | N presence bits     Set if block in                                                                                                                                                                                        |                                                                                                                                                                                                                                   |                               |                                    |
| <ul> <li>Directory (entry for each CL) keeps track of all owning caches</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                             | <ul> <li>1 dirty bit (red)</li> </ul>                                                                                                                                                                                      |                                                                                                                                                                                                                                   |                               |                                    |
| Point-to-point update to involved processors                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                                             |                                                                                                                                                                                                                            |                                                                                                                                                                                                                                   |                               |                                    |
| No broadcast<br>Can use specialized (high-bandwidth) network, e.g., HT, QPI                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                             | First proposed by Ce                                                                                                                                                                                                       | nsier and Feautrier (1978)                                                                                                                                                                                                        | Directory                     | X = 7                              |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | e1                                          |                                                                                                                                                                                                                            |                                                                                                                                                                                                                                   |                               |                                    |









## Slide 23, RAM: © Raimond Spekking / <u>CC BY-SA 4.0</u> (via Wikimedia Commons) https://commons.wikimedia.org/wiki/File:Apacer\_SDRAM-3386.jpg