#### You should have a project partner by now And a topic! **Design of Parallel and High-Performance** Progress presentations: Monday 11/2 (two weeks from today!) Computing ... may continue the following Thursday in recitation (order will be randomized) Fall 2015 • Send slides (ppt or pdf) by Sunday 11/1 11:59pm to Timo! Lecture: Languages and Locks • 10 minutes per team (hard limit) Prepare! This is your first impression, gather feedback from us! Rough guidelines: Motivational video: <u>https://www.youtube.com/watch?v=1o4YViBAGU0</u> Present your plan Related work (what exists, careful literature review!) Instructor: Torsten Hoefler & Markus Püschel Preliminary results (what are your detailed plans, milestones) TAs: Timo Schneider Main goal is to gather feedback, so present some details Ideally one presenter (make sure to switch for other presentations!) ETH Final project presentation: Monday 12/14 during last lecture Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich **Review of last lecture** Peer Quiz Locked Queue Instructions: Correctness Pick some partners (locally) and discuss each question for 2 minutes Lock-free two-thread queue We then select a random student (team) to answer the question Linearizability Combine object pre- and postconditions with serializability How can histories be used to proof a parallel code correct? Additional (semantic) constraints! How do histories relate to the source code? Histories Can proofing be automated? Analyze given histories

Projections, Sequential/Concurrent, Completeness, Equivalence, Well

### Goals of this lecture

- Languages and Memory Models Java/C++ definition
- **Recap serial consistency** 
  - Races (now in practice)
- Mutual exclusion
- Locks
- Two-thread
- Peterson N-thread
- Many different locks, strengths and weaknesses
- Lock options and parameters
- Problems and outline to next class
- **DPHPC Overview** DPHPC parallelism locality concepts & techniques vector ISA shared memory distributed memory caches
  memory hierarchy cache coherency memory distributed models algorithms group commu-nications lock free wait free linearizability Amdahl's and Gustafson's law models LogP memory PRAM

balance principles I balance principles II Little's Law scheduling

α-β I/O complexity

formed, Linearizability (formal)

## Administrivia

- What are the practical limits of linearizability?
  - Can it always be applied?
  - Is there a performance tradeoff? Always? Sometimes? Never?



All memory accesses before an unlock ...

are ordered before and are visible to ...

any memory access after a matching lock!

- mutual exclusion
- at most one thread may own a lock
- a thread B trying to acquire a lock held by thread A blocks until thread A releases lock
- note: threads may wait forever (no progress guarantee!)

# Synchronization Variables

- Variables can be declared volatile (Java) or atomic (C++)
- Reads and writes to synchronization variables
  - Are totally ordered with respect to all threads
  - Must not be reordered with normal reads and writes

#### Compiler

- Must not allocate synchronization variables in registers
- Must not swap variables with synchronization variables
- May need to issue memory fences/barriers
- ...

### Synchronization Variables

- Write to a synchronization variable
   Similar memory semantics as unlock (no process synchronization!)
- Read from a synchronization variable
  - Similar memory semantics as lock (no process synchronization!)



### **Memory Model Rules**

- Java/C++: Correctly synchronized programs will execute sequentially consistent
- Correctly synchronized = data-race free
   iff all sequentially consistent executions are free of data races
- Two accesses to a shared memory location form a data race in the execution of a program if
  - The two accesses are from different threads
  - At least one access is a write and
  - The accesses are not synchronized



Manson, Pugh, Adve: "The Java Memory Model", POPL'05

#### **Preliminary Comments**

- All code examples are in C/C++ style
  - Neither C nor C++ <11 have a clear memory model</li>
  - C++ is one of the languages of choice in HPC
  - Consider source as exemplary (and pay attention to the memory model)! In fact, many/most of the examples are incorrect in anything but sequential consistency! In fact, you'll most likely not need those algorithms, but the principles will be useful!
- x86 is really only used because it's common
  - This does not mean that we consider the ISA or memory model elegant!
    We assume atomic memory (or registers)!
  - Usually given on x86 (easy to enforce)
- Number of threads/processes is p, tid is the thread id

# Case Study: Locks - Lecture Goals

- Among the simplest concurrency constructs
  - Yet, complex enough to illustrate many optimization principles
- Goal 1: You understand locks in detail
  - Requirements / guarantees
  - Correctness / validation
  - Performance / scalability
- Goal 2: Acquire the ability to design your own locks
  - Understand techniques and weaknesses/traps
     Extend to other concurrent algorithms
  - Issues are very much the same
- Goal 3: Feel the complexity of shared memory!

#### **Recap Concurrent Updates** movl \$1000. %eax // i=n=1000 .12: const int n=1000; movl (%rdx), %ecx // ecx = \*a volatile int a=0; gcc -O3 addl \$1.%ecx // ecx++ for (int i=0; i<n; ++i) // i subl \$1. %eax a++: movl %ecx, (%rdx) // \*a = ecx jne .L2 // loop if i>0 Multi-threaded execution! Value of a for p=1? Value of a for p>1? Why? Isn't it a single instruction? movl \$1000. %eax // i=n=1000 movl \$0, -24(%rsp) // a = 0 const int n=1000: // a is visible! mfence std::atomic<int> a; .12: g++ -O3 a=0: lock addl \$1, -24(%rsp) // (\*a)++ for (int i=0; i<n; ++i) subl \$1. %eax // i a++; // loop if i>0 ine .L2





void unlock() { flag[tid] = false;

3

#### Simple Two-Thread Locks **Correctness Proof** In sequential consistency! Another two-thread spin-lock: LockOne Intuitions: Situation: both threads are ready to enter volatile int flag[2]; Show that situation that allows both to enter leads to a schedule violating sequential consistency When and why does this void lock() { guarantee mutual exclusion? Using transitivity of program and synchronization orders int j = 1 - tid; flag[tid] = true; while (flag[j]) {} // wait Does it work in practice? void unlock() { flag[tid] = false; Simple Two-Thread Locks **Correctness Proof** A third attempt at two-thread locking: LockTwo Intuition: Victim is only written once per lock() • A can only enter after B wrote B cannot enter in any sequentially consistent schedule volatile int victim; Does this guarantee mutual exclusion? void lock() { victim = tid; // grant access while (victim == tid) {} // wait void unlock() {} Simple Two-Thread Locks Simple Two-Thread Locks A third attempt at two-thread locking: LockTwo The last two locks provide mutual exclusion LockOne succeeds iff lock attempts do not overlap LockTwo succeeds iff lock attempts do overlap Combine both into one locking strategy! volatile int victim; Does this guarantee Peterson's lock (1981) mutual exclusion? void lock() { victim = tid; // grant access while (victim == tid) {} // wait Does it work in practice? void unlock() {}

### Peterson's Two-Thread Lock (1981)

Combines the first lock (request access) with the second lock (grant access)

#### volatile int flag[2]; volatile int victim;

void lock() {
 int j = 1 - tid;
 flag[tid] = 1; // I'm interested
 victim = tid; // other goes first
 while (flag[j] && victim == tid) {}; // wait
}

void unlock() {
 flag[tid] = 0; // I'm not interested
}

### **Starvation Freedom**

- (recap) definition: Every thread that calls lock() eventually gets the lock.
   Implies deadlock-freedom!
- Is Peterson's lock starvation-free?

volatile int flag[2]; volatile int victim:

void lock() {
 int j = 1 - tid;
 flag[tid] = 1; // I'm interested
 victim = tid; // other goes first
 while (flag[j] && victim == tid) {}; // wait

void unlock() {
 flag[tid] = 0; // I'm not interested

### **Proof Correctness**

#### Intuition:

- Victim is written once
- Pick thread that wrote victim last
- Show thread must have read flag==0
- Show that no sequentially consistent schedule permits that

### **Proof Starvation Freedom**

#### Intuition:

- Threads can only wait/starve in while() Until flag==0 or victim==other
- Other thread enters lock() = sets victim to other Will definitely "unstuck" first thread
- So other thread can only be stuck in lock()
   Will wait for victim==other, victim cannot block both threads and one must leave!

### Peterson in Practice ... on x86

- Implement and run our little counter on x86
- 100000 iterations

1.6 2 10<sup>-6</sup>% errors
What is the problem?

volatile int flag[2]; volatile int victim; void lock() { int j = 1 - tid; flag[tid] = 1; // I'm interested victim = tid; // other goes first while (flag[j] && victim == tid) {}; // wait }

void unlock() {
 flag[tid] = 0; // I'm not interested

#### Peterson in Practice ... on x86

Implement and run our little counter on x86

#### 100000 iterations

 1.6 2 10-% errors
 What is the problem? No sequential consistency for W(y) and

R(flag[j])

volatile int flag[2]; volatile int victim; void lock() {

int j = 1 - tid; flag[tid] = 1; // I'm interested victim = tid; // other goes first asm ("mfence"); while (flag[j] && victim == tid) {}; // wait

void unlock() {
 flag[tid] = 0; // I'm not interested

42



#### **Filter Lock Starvation Freedom Filter Lock** Intuition: What are the disadvantages of this lock? Inductive argument over j (levels) Base-case: level n-1 has one thread (not stuck) volatile int level[n] = {0,0,...,0}; // indicates highest level a thread tries to enter Level i: assume thread is stuck volatile int victim[n]; // the victim thread, excluded from next level Eventually, higher levels will drain (induction) void lock() { Last entering thread is victim it will wait for (int i = 1; i < n; i++) { // attempt level i level[tid] = i; Thus, only one thread can be stuck at each level victim[i] = tid; Victim can only have one value 👝 older threads will advance! // spin while conflicts exist while ((∃k != tid) (level[k] >= i && victim[i] == tid )) {}; } void unlock() { level[tid] = 0; Lock Fairness Lamport's Bakery Algorithm (1974) Starvation freedom provides no guarantee on how long a thread Is a FIFO lock (and thus fair) waits or if it is "passed"! Each thread takes number in doorway and threads enter in the order To reason about fairness, we define two sections of each lock of their number! algorithm: Doorway D (bounded # of steps) void lock() { Waiting W (unbounded # of steps) int i = 1 - tid: volatile int flag[n] = {0,0,...,0}; flag[tid] = true; // I'm interested volatile int label[n] = {0,0,....,0}; victim = tid; // other goes first while (flag[j] && victim == tid) {}; void lock() { flag[tid] = 1; // request FIFO locks: label[tid] = max(label[0], ...,label[n-1]) + 1; // take ticket while (( $\exists k != tid$ )(flag[k] && (label[k],k) <\* (label[tid],tid))) {}; If T<sub>A</sub> finishes its doorway before T<sub>B</sub> the CR<sub>A</sub> CR<sub>B</sub> Implies fairness public void unlock() { flag[tid] = 0; Lamport's Bakery Algorithm (1974) A Lower Bound to Memory Complexity Theorem 5.1 in [1]: "If S is a [atomic] read/write system with at least Advantages: two processes and S solves mutual exclusion with global progress Elegant and correct solution Starvation free, even FIFO fairness [deadlock-freedom], then S must have at least as many variables as processes" Not used in practice! So we're doomed! Optimal locks are available and they're Why? fundamentally non-scalable. Or not? Needs to read/write N memory locations for synchronizing N threads Can we do better? Using only atomic registers/memory [1] J. E. Burns and N. A. Lynch. Bounds on shared memory for mutual exclusion. Information and Computation, 107(2):171-184, December 1993

#### Hardware Support? **Relative Power of Synchronization** Design-Problem I: Multi-core Processor Hardware atomic operations: Test&Set Which atomic operations are useful? Write const to memory while returning the old value **Design-Problem II: Complex Application** Atomic swap What atomic should I use? Atomically exchange memory and register Concept of "consensus number" C if a primitive can be used to solve the Fetch&Op "consensus problem" in a finite number of steps (even if threads stop) Get value and apply operation to memory location atomic registers have C=1 (thus locks have C=1!) Compare&Swap TAS, Swap, Fetch&Op have C=2 Compare two values and swap memory with register if equal ■ CAS, LL/SC, TM have C=∞ Load-linked/Store-Conditional LL/SC Loads value from memory, allows operations, commits only if no other updates committed 👝 mini-TM Intel TSX (transactional synchronization extensions) Hardware-TM (roll your own atomic operations) **Test-and-Set Locks** Test-and-Set Locks Test-and-Set semantics Assume TASval indicates "locked" Memoize old value Write something else to indicate "unlocked" bool test\_and\_set (bool \*flag) { Set fixed value TASval (true) bool old = \*flag; TAS until return value is != TASval Return old value \*flag = true; return old; After execution: }// all atomic! Post-condition is a fixed (constant) value! When will the lock be volatile int lck = 0; granted? void lock() { Does this work well in while (TestAndSet(&lck) == 1); practice? void unlock() { lck = 0; Test-and-Test-and-Set (TATAS) Locks Contention Spinning in TAS is not a good idea On x86, the XCHG instruction is used to implement TAS • For experts: x86 LOCK is superfluous! Spin on cache line in shared state movl \$1, %eax All threads at the same time, no cache coherency/memory traffic Cacheline is read and written xchg %eax, (%ebx) • Ends up in exclusive state, invalidates other copies • Cacheline is "thrown" around uselessly Danger! High load on memory subsystem volatile int lck = 0: Efficient but use with great x86 bus lock is essentially a full memory barrier --+ care! void lock() { Generalizations are do { dangerous while (lck == 1): } while (TestAndSet(&lck) == 1); void unlock() { lck = 0:



#### **Array Queue Lock** CLH Lock (1993) typedef struct qnode { Array to implement List-based (same queue struct qnode \*prev; principle) aueue int succ\_blocked; volatile int array[n] = {1,0,...,0}; Tail-pointer shows next free Discovered twice by Craig, } anode: volatile int index[n] = {0,0,...,0}; queue position volatile int tail = 0; Landin, Hagersten 1993/94 Each thread spins on own qnode \*lck = new qnode; // node owned by lock 2N+3M words location void lock() { CL padding! N threads, M locks void lock(qnode \*lck, qnode \*qn) { index[tid] = GetAndInc(tail) % n; qn->succ\_blocked = 1; index[] array can be put in TLS while (!array[index[tid]]); // wait to receive lock **Requires thread-local qnode** qn->prev = FetchAndSet(lck, qn); So are we done now? pointer while (qn->prev->succ\_blocked); What's wrong? Can be hidden! void unlock() { Synchronizing M objects array[index[tid]] = 0; // I release my lock void unlock(qnode \*\*qn) { requires $\Theta(NM)$ storage array[(index[tid] + 1) % n] = 1; // next one qnode \*pred = (\*qn)->prev; What do we do now? (\*qn)->succ\_blocked = 0; \*qn = pred; typedef struct qnode { struct anode \*next: CLH Lock (1993) MCS Lock (1991) int succ blocked: } anode: typedef struct qnode { **Qnode objects represent** Make queue explicit qnode \*lck = NULL; struct qnode \*prev; thread state! Acquire lock by int succ\_blocked; void lock(qnode \*lck, qnode \*qn) { succ\_blocked == 1 if waiting appending to queue } qnode; gn->next = NULL; or acquired lock Spin on own node qnode \*pred = FetchAndSet(lck, qn); until locked is reset succ blocked == 0 if released qnode \*lck = new qnode; // node owned by lock if(pred != NULL) { lock Similar advantages qn->locked = 1; void lock(qnode \*lck, qnode \*qn) { List is implicit! as CLH but pred->next = qn; qn->succ\_blocked = 1; while(qn->locked); One node per thread Only 2N + M words qn->prev = FetchAndSet(lck, qn); }} Spin location changes Spinning position is fixed! while (qn->prev->succ\_blocked); NUMA issues (cacheless) Benefits cache-less NUMA void unlock(qnode \* lck, qnode \*qn) { if(qn->next == NULL) { // if we're the last waiter Can we do better? What are the issues? void unlock(qnode \*\*qn) { if(CAS(lck, qn, NULL)) return; Releasing lock spins qnode \*pred = (\*qn)->prev; while(qn->next == NULL); // wait for pred arrival (\*qn)->succ\_blocked = 0; More atomics! \*qn = pred; qn->next->locked = 0; // free next waiter qn->next = NULL; **Lessons Learned!** Time to Declare Victory? Down to memory complexity of 2N+M Key Lesson: Reducing memory (coherency) traffic is most important! Probably close to optimal Not always straight-forward (need to reason about CL states) Only local spinning Several variants with low expected contention MCS: 2006 Dijkstra Prize in distributed computing But: we assumed sequential consistency = "an outstanding paper on the principles of distributed computing, whose Reality causes trouble sometimes significance and impact on the theory and/or practice of distributed Sprinkling memory fences may harm performance computing has been evident for at least a decade" Open research on minimally-synching algorithms!

- "probably the most influential practical mutual exclusion algorithm ever" Come and talk to me if you're interested
- "vastly superior to all previous mutual exclusion algorithms"
- fast, fair, scalable in widely used, always compared against!

72