# **Design of Parallel and High-Performance Computing**

Fall 2016

Lecture: Introduction

Instructor: Torsten Hoefler & Markus Püschel

TA: Salvatore Di Girolamo

### ETH

idgenössische Technische Hochschule Zürich

### Goals of this lecture

- Motivate you!
- What is parallel computing?
  - And why do we need it?
- What is high-performance computing?
  - What's a Supercomputer and why do we care?
- Basic overview of
  - Programming models
    - Some examples
  - ArchitecturesSome case-studies
- Provide context for coming lectures

2

### Let us assume ...

... you were to build a machine like this ...



... we know how each part works

- There are just many of them!
- Question: How many calculations per second are needed to emulate a brain?







### **Human Brain - No Problem!**

... not so fast, we need to understand how to program those machines ...

### **Human Brain – No Problem!**

Simulating 1 second of human brain activity takes 82,944 processors



Scooped!

**y** Tweet

228 X +1 In Share

entirely true anymore. Researchers at the Okinawa Institute of Technology Graduate University in Japan and

Forschungszentrum Jülich in Germany have mar brain activity in a very, very powerful computer.

### Other problem areas: Scientific Computing

- Most natural sciences are simulation driven or are moving towards simulation
  - Theoretical physics (solving the Schrödinger equation, QCD)
  - Biology (Gene sequencing)
  - Chemistry (Material science)
  - Astronomy (Colliding black holes)
  - Medicine (Protein folding for drug discovery)
  - Meteorology (Storm/Tornado prediction)
  - Geology (Oil reservoir management, oil exploration)
  - and many more ... (even Pringles uses HPC)



### Other problem areas: Commercial Computing

- Databases, data mining, search
  - Amazon, Facebook, Google
- Transaction processing
  - Visa, Mastercard
- **Decision support** 
  - Stock markets, Wall Street, Military applications
- Parallelism in high-end systems and back-ends
  - Often throughput-oriented
  - Used equipment varies from COTS (Google) to high-end redundant mainframes (banks)

### Other problem areas: Industrial Computing

- Aeronautics (airflow, engine, structural mechanics, electromagnetism)
- Automotive (crash, combustion, airflow)
- Computer-aided design (CAD)
- Pharmaceuticals (molecular modeling, protein folding, drug design)
- Petroleum (Reservoir analysis)
- Visualization (all of the above, movies, 3d)

### What can faster computers do for us?

- Solving bigger problems than we could solve before!
  - E.g., Gene sequencing and search, simulation of whole cells, mathematics of the brain, ..
  - The size of the problem grows with the machine power → Weak Scaling
- Solve today's problems faster!
  - E.g., large (combinatorial) searches, mechanical simulations (aircrafts, cars, weapons, ...)
  - The machine power grows with constant problem size → Strong Scaling

### **High-Performance Computing (HPC)**

- a.k.a. "Supercomputing"
- Question: define "Supercomputer"!

### **High-Performance Computing (HPC)**

- a.k.a. "Supercomputing"
- Question: define "Supercomputer"!
  - "A supercomputer is a computer at the frontline of contemporary processing capacity--particularly speed of calculation." (Wikipedia)
  - Usually quite expensive (\$s and kWh) and big (space)
- HPC is a quickly growing niche market
  - Not all "supercomputers", wide base
  - Important enough for vendors to specialize
  - Very important in research settings (up to 40% of university spending)
    - "Goodyear Puts the Rubber to the Road with High Performance Computing"
    - "High Performance Computing Helps Create New Treatment For Stroke Victims"
      "Procter & Gamble: Supercomputers and the Secret Life of Coffee"
    - "Motorola: Driving the Cellular Revolution With the Help of High Performance Computing"
    - "Microsoft: Delivering High Performance Computing to the Masses"

14

# The Top500 List

- A benchmark, solve Ax=b
  - As fast as possible! → as big as possible ©
  - Reflects **some** applications, not all, not even many
  - Very good historic data!
- Speed comparison for computing centers, states, countries, nations, continents ⊗
  - Politicized (sometimes good, sometimes bad)
  - Yet, fun to watch

# 

15



March 19, 2013

# Swiss 'GPU Supercomputer' Will Be Fastest in Europe

Tiffany Trader

Page: 1|

The NVIDIA GPU Technology Conference is in full-swing today in San Jose, Calif. The annual event kicked off this morning with a keynote from NVIDIA CEO Jen-Hsun Huang, who revealed that the Swiss National Supercomputing Center (CSCS) is building Europe's fastest GPU-accelerated supercomputer, an extension of a Cray system that was announced last year.

As Cray Vice President, Storage & Data Management Barry Bolding told *HPCwire*, this will be the first Cray supercomputer equipped with Intel Xeon processors and NVIDA GPUs.



CSCS is part of ETH Zurich, one of the top universities in the world and the alma mater of Albert Einstein. The supercomputing center installed phase one of its shiny new Cray XC30 back in December 2012.

18



### **Blue Waters in 2012**



### **History and Trends** 162 PFlop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s Single GPU/MIC Card 1 Tflop/s 100 Gflop/s My Laptop (70 Gflop/s) 59.7 GFlop/s 10 Gflop/s My iPad2 & iPhone 4s (1.02 Gflop/s) 1 Gflop/s 100 Mflop/s 1993

# **High-Performance Computing grows quickly**

- Computers are used to automate many tasks
- Still growing exponentially
  - New uses discovered continuously

IDC, 2007: "The overall HPC server market grew by 15.5 percent in 2007 to reach \$11.6 billion [...] while the same kinds of boxes that go into HPC machinery but are used for general purpose computing, rose by only 3.6 per

IDC, 2009: "expects the HPC technical server market to grow at a healthy 7% to 8% yearly rate to reach revenues of \$13.4 billion by 2015."

"The non-HPC portion of the server market was actually down 20.5 per cent, to \$34.6bn"



How to increase the compute power?



How to increase the compute power?







### So how to invest the transistors?

- Architectural innovations
  - Branch prediction, Tomasulo logic/rename register, speculative execution,
  - Help only so much ⊗
- What else?
  - Simplification is beneficial, less transistors per CPU, more CPUs, e.g., Cell B.E., GPUs, MIC
  - We call this "cores" these days
  - Also, more intelligent devices or higher bandwidths (e.g., DMA controller, intelligent NICs)







27

### Towards the age of massive parallelism

- Everything goes parallel
  - Desktop computers get more cores 2,4,8, soon dozens, hundreds?
  - Supercomputers get more PEs (cores, nodes)
    - > 3 million today
    - > 50 million on the horizon
    - >1 billion in a couple of years (after 2020)
- Parallel Computing is inevitable!

Parallel vs. Concurrent computing
Concurrent activities may be executed in parallel
Example:

A1 starts at T1, ends at T2; A2 starts at T3, ends at T4 Intervals (T1,T2) and (T3,T4) may overlap! Parallel activities:

A1 is executed **while** A2 is running

Usually requires separate resources!

### Goals of this lecture

- Motivate you!
- What is parallel computing?
  - And why do we need it?
- What is high-performance computing?
  - What's a Supercomputer and why do we care?
- Basic overview of
  - Programming models Some examples
  - Architectures

Some case-studies

Provide context for coming lectures

### **Granularity and Resources**

# Activities Micro-code instruction Machine-code instruction (complex or simple) Sequence of machine-code instructions: Blocks Loops Loop nests Functions Function sequences

# Parallel Resource Instruction-level parallelism Pipelining VLIW Superscalar SIMD operations Vector operations Instruction sequences Multiprocessors Multicores Multithreading

29

# **Resources and Programming**

### **Parallel Resource**

- Instruction-level parallelism
  - Pipelining
  - VLIW
  - Superscalar
- SIMD operations
  - Vector operations

  - Instruction sequences
    - Multiprocessors
    - Multicores
    - Multithreading

### **Programming**

- - (inline assembly)
  - Hardware scheduling
- Compiler (inline assembly)
- Libraries
- Compilers (very limited)
- Expert programmers
  - Parallel languages
  - Parallel libraries
  - Hints

31

# **Historic Architecture Examples** Systolic Array Data-stream driven (data counters) Multiple streams for parallelism Specialized for applications (reconfigurable)

**Dataflow Architectures** 

- No program counter, execute instructions when all input arguments are available
- Fine-grained, high overheads Example: compute f = (a+b) \* (c+d)



33

### **Von Neumann Architecture**

Program counter → Inherently serial! Retrospectively define parallelism in instructions and data

| SISD<br>Standard Serial Computer<br>(nearly extinct) | SIMD Vector Machines or Extensions (very common) |
|------------------------------------------------------|--------------------------------------------------|
| MISD Redundant Execution (fault tolerance)           | MIMD<br>Multicore<br>(ubiquituous)               |

**Parallel Architectures 101** 





... and mixtures of those





## **Programming Models**

- Shared Memory Programming (SM/UMA)
  - Shared address space
  - Implicit communication
  - Hardware for cache-coherent remote memory access
  - Cache-coherent Non Uniform Memory Access (cc NUMA)
- (Partitioned) Global Address Space (PGAS)
  - Remote Memory Access
  - Remote vs. local memory (cf. ncc-NUMA)
- **Distributed Memory Programming (DM)** 
  - Explicit communication (typically messages)
  - Message Passing





### **Shared Memory Machines**

### Two historical architectures:

■ "Mainframe" – all-to-all connection between memory, I/O and PEs

Often used if PE is the most expensive part Bandwidth scales with P

PE Cost scales with P, Question: what about network cost?



### **Shared Memory Machines**

### Two historical architectures:

 "Mainframe" – all-to-all connection between memory, I/O and PEs

Often used if PE is the most expensive part Bandwidth scales with P

PE Cost scales with P, Question: what about network cost?

Answer: Cost can be cut with multistage connections (butterfly)

"Minicomputer" — bus-based connection
All traditional SMP systems
High latency, low bandwidth (cache is important)
Tricky to achieve highest performance (contention)
Low cost, extensible



# **Shared Memory Machine Abstractions**

- Any PE can access all memory
  - Any I/O can access all memory (maybe limited)
- OS (resource management) can run on any PE
  - Can run multiple threads in shared memory
  - Used since 40+ years

### Communication through shared memory

- Load/store commands to memory controller
- Communication is implicit
- Requires coordination
- Coordination through shared memory
  - Complex topic
  - Memory models



38

## **Shared Memory Machine Programming**

- Threads or processes
  - Communication through memory
- Synchronization through memory or OS objects
  - Lock/mutex (protect critical region)
  - Semaphore (generalization of mutex (binary sem.))
  - Barrier (synchronize a group of activities)
  - Atomic Operations (CAS, Fetch-and-add)
  - Transactional Memory (execute regions atomically)
- Practical Models:
  - Posix threads
  - MPI-3
  - OpenMP
  - Others: Java Threads, Qthreads, ...



39

# An SMM Example: Compute Pi

Using Gregory-Leibnitz Series:

$$4\sum_{k=0}^{\infty} \frac{(-1)^k}{2k+1}$$

- Iterations of sum can be computed in parallel
- Needs to sum all contributions at the end



40

# **Pthreads Compute Pi Example**



### Additional comments on SMM

- OpenMP would allow to implement this example much simpler (but has other issues)
- Transparent shared memory has some issues in practice:
  - False sharing (e.g., resultarr[])
  - Race conditions (complex mutual exclusion protocols)
  - Little tool support (debuggers need some work)
- Achieving performance is harder than it seems!

### **Distributed Memory Machine Programming**

- Explicit communication between PEs
  - Message passing or channels
- Only local memory access, no direct access to remote memory
  - No shared resources (well, the network)



43

- Programming model: Message Passing (MPI, PVM)
  - Communication through messages or group operations (broadcast, reduce, etc.)
  - Synchronization through messages (sometimes unwanted side effect) or group operations (barrier)
  - Typically supports message matching and communication contexts

**DMM Example: Message Passing** 





- Send specifies buffer to be transmitted
- Recv specifies buffer to receive into
- Implies copy operation between named PEs
- Optional tag matching
- Pair-wise synchronization (cf. happens before)

44



# **DMM Example: PGAS**

- Partitioned Global Address Space
  - Shared memory emulation for DMM
     Usually non-coherent
  - "Distributed Shared Memory"
     Usually coherent



- Has similar problems as SMM programming
- Sometimes lacks performance transparency Local vs. remote accesses
- Examples:
  - UPC, CAF, Titanium, X10, ...



### **How to Tame the Beast?**

- How to program large machines?
- No single approach, PMs are not converging yet
  - MPI, PGAS, OpenMP, Hybrid (MPI+OpenMP, MPI+MPI, MPI+PGAS?), ...
- Architectures converge

MPI\_Finalize();

- General purpose nodes connected by general purpose or specialized networks
- Small scale often uses commodity networks
- Specialized networks become necessary at scale
- Even worse: accelerators (not covered in this class, yet)







### **Practical SMM Programming: Pthreads**

Covered in example, small set of functions for thread creation and management







# **Practical PGAS Programming: UPC**

PGAS extension to the C99 language

Thread 0 Thread 1 Thread 2 Thread 3 Shared c[0], c[4],. c[1], c[5],.. c[2], c[6],. c[3], c[7],.. Private

- Many helper library functions
  - Collective and remote allocation
  - Collective operations
- Complex consistency model



# **Complete Six Function MPI-1 Example**





Support for shared memory in SMM domains



- **Support for Remote Memory Access Programming** 
  - Direct use of RDMA
  - Essentially PGAS



- Scalable topologies
- More nonblocking features
- ... many more









Advanced MPI, including MPI-3



# **Accelerator example: CUDA**

```
Host Code
#define N 10
                                                                                            The Kernel
int main( void ) {
 int a[N], b[N], c[N];
                                                                              global__ void add( int *a, int *b, int *c ) {
 int *dev_a, *dev_b, *dev_c;
                                                                            int tid = blockldx.x:
                                                                            // handle the data at this index
// allocate the memory on the GPU
 cudaMalloc( (void**)&dev_a, N * sizeof(int) );
                                                                            if (tid < N)
                                                                             c[tid] = a[tid] + b[tid];
 \textbf{cudaMalloc(} \ (\text{void**}) \& \text{dev\_b, N* sizeof(} \textbf{int) });
 cudaMalloc( (void**)&dev_c, N * sizeof(int) );
 // fill the arrays 'a' and 'b' on the CPU
 for (int i=0: i<N: i++) { a[i] = -i: b[i] = i * i: }
 // copy the arrays 'a' and 'b' to the GPU
 \textbf{cudaMemcpy}(\ \text{dev\_a, a, N} * sizeof(\textbf{int}), \text{cudaMemcpyHostToDevice}\ );
 cudaMemcpy( dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice );
 add<<<N,1>>>( dev_a, dev_b, dev_c );
 // copy the array 'c' back from the GPU to the CPU
 cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost );
 // free the memory allocated on the GPU
 cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c );
                                                                                                                              57
```

# OpenACC / OpenMP 4.0

- Aims to simplify GPU programming
- **Compiler support** 
  - Annotations!

#define N 10 int main( void ) { int a[N], b[N], c[N]; #pragma acc kernels for (int i = 0; i < N; ++i) c[i] = a[i] + b[i];

# More programming models/frameworks

- Not covered:
  - SMM: Intel Cilk / Cilk Plus, Intel TBB, ...
  - Directives: OpenHMPP, PVM, ...
  - PGAS: Coarray Fortran (Fortran 2008), ...
  - HPCS: IBM X10, Fortress, Chapel, ...
  - Accelerator: OpenCL, C++AMP, ...
- This class will not describe any model in more detail!
  - There are too many and they will change quickly (only MPI made it >15 yrs)
- No consensus, but fundamental questions remain:
  - Data movement
  - Synchronization
  - Memory Models
  - Algorithmics
  - Foundations

### Goals of this lecture

- Motivate you!
- What is parallel computing?
  - And why do we need it?
- What is high-performance computing?
  - What's a Supercomputer and why do we care?
- Basic overview of
  - Programming models Some examples
  - Architectures

Some case-studies

Provide context for coming lectures



### **Computer Architecture vs. Physics**

- Physics (technological constraints)
  - Cost of data movement
  - Capacity of DRAM cells
  - Clock frequencies (constrained by end of Dennard scaling)
  - Speed of Light
  - Melting point of silicon
- Computer Architecture (design of the machine)
  - Power management
  - ISA / Multithreading
  - SIMD widths

"Computer architecture, like other architecture, is the art of determining the needs of the user of a structure and then designing to meet those needs as effectively as possible within economic and technological constraints." – Fred Brooks (IBM, 1962)

Have converted many former "power" problems into "cost" problems

62





 Cubic power improvement with lower clock rate due to V<sup>2</sup>F



Slower clock rates enable use of simpler cores



Simpler cores use less area (lower leakage) and reduce cost



Tailor design to application to REDUCE WASTE

63

**Low-Power Design Principles (2005)** 



- Power5 (server)
  - 120W@1900MHz
- Baseline
- Intel Core2 sc (laptop) :
  - 15W@1000MHz
  - 4x more FLOPs/watt than baseline
- Intel Atom (handhelds)
  - 0.625W@800MHz
  - 80x more
- GPU Core or XTensa/Embedded
  - 0.09W@600MHz
  - 400x more (80x-120x sustained)

n Shaif (LBNL)

# **Low-Power Design Principles (2005)**



- Power5 (server)
  - 120W@1900MHz
  - Baseline
- Intel Core2 sc (laptop) :
  - 15W@1000MHz
  - 4x more FLOPs/watt than baseline
- Intel Atom (handhelds)
  - 0.625W@800MHz
  - 80x more
- GPU Core or XTensa/Embedded
  - 0.09W@600MHz
  - 400x more (80x-120x sustained)

Even if each simple core is 1/4th as computationally efficient as complex core, you can fit hundreds of them on a single chip and still be 100x more power efficient.

hn Shalf (LBNL)

# Heterogeneous Future (LOCs and TOCs)



Latency Optimized Core (LOC)

Most energy efficient if vi

Most energy efficient if you don't have lots of parallelism



Throughput Optimized Core (TOC)

Most energy efficient if you DO have a lot of parallelism!

redit: John Shalf (LBNL

















# **POWER7 Core**

- Execution Units
  - 2 Fixed point units
  - 2 Load store units
  - 4 Double precision floating point1 Branch

  - 1 Condition register

  - 1 Vector unit1 Decimal floating point unit
  - 6 wide dispatch
- Recovery Function Distributed
- 1,2,4 Way SMT Support
- Out of Order Execution
- 32KB I-Cache
- 32KB D-Cache
- 256KB L2
  - Tightly coupled to core

Source: IBM/NCSA



### **POWER7 Chip (8 cores)**

### Base Technology

- 45 nm, 576 mm<sup>2</sup>
- 1.2 B transistors

### Chip

- 8 cores
- 4 FMAs/cycle/core
- 32 MB L3 (private/shared)
- Dual DDR3 memory 128 GiB/s peak bandwidth (1/2 byte/flop)
- Clock range of 3.5 4 GHz

Source: IBM/NCSA



# **Quad Chip Module (4 chips)**

### 32 cores

32 cores\*8 F/core\*4 GHz = 1 TF

### 4 threads per core (max)

128 threads per package

### 4x32 MiB L3 cache

512 GB/s RAM BW (0.5 B/F)

800 W (0.8 W/F)

Source: IBM/NCSA



# Adding a Network Interface (Hub)

Connects QCM to PCI-e

 Two 16x and one 8x PCI-e slot Connects 8 QCM's via low

latency, high bandwidth, copper fabric.

Provides a message passing mechanism with verv high bandwidth

 Provides the lowest possible latency between 8 QCM's

Source: IBM/NCSA













### Goals of this lecture

- Motivate you!
- What is parallel computing?
  - And why do we need it?
- What is high-performance computing?
  - What's a Supercomputer and why do we care?
- Basic overview of
  - Programming models
    - Some examples
  - ArchitecturesSome case-studies
- Provide context for coming lectures

### **DPHPC Lecture**

- You will most likely not have access to the largest machines
  - But our desktop/laptop will be a "large machine" soon
  - HPC is often seen as "Formula 1" of computing (architecture experiments)
- DPHPC will teach you concepts!
  - Enable to understand and use all parallel architectures
  - From a quad-core mobile phone to the largest machine on the planet!
     MCAPI vs. MPI same concepts, different syntax
  - No particular language (but you should pick/learn one for your project!)

    Parallelism is the future:



89

### Related classes in the SE focus

263-2300-00L How to Write Fast Numerical Code http://www.inf.ethz.ch/personal/markusp/teaching/263-2300-ETHspring16/course.html

Spring 2017

Lecturer: Prof. M. Pueschel

This list is not exhaustive!

DPHPC Overview



90