#### **ETH**zürich

J. DE FINE LICHT AND T. HOEFLER

# Productive Parallel Programming on FPGA with High-Level Synthesis



#### **Based on material from:**

Transformations of High-Level Synthesis Codes for High-Performance Computing https://arxiv.org/abs/1805.08288

### **Code examples found at:**

https://github.com/spcl/hls\_tutorial\_examples

#### **Virtual machine for emulation**

http://spcl.inf.ethz.ch/~definelj/HLS\_Tutorial.7z

#### **Nimbix Alveo Trial**

https://www.nimbix.net/alveotrial



#### Our goal



and the second s



#### A single operation



A PART PART CONTRACTOR

4



#### **Register transfer level**

```
always @(posedge clk)
if (start) begin
    out <= in + 1;
end</pre>
```



COALCOLLESS CONTRACTOR

$$int c = a + b;$$



#### Single <u>floating point</u> operation



Contraction of the second second second second



#### Our point of view

# In HLS, we treat **pipelines**.





#### **Pipelines**



Contraction of the second



#### **Multiple floating point operations**



Caller & and and the set



#### **Initiation interval**



The second second



#### Pipelines vol. II





#### **Adding loops**

for (int i = 0; i < N; ++i) {
 #pragma HLS PIPELINE II=1
 c[i] = (a[i] + b[i]) \*
 (a[i] - b[i]);
}</pre>

| 1 iteration   | 13 + 1 = 14 cycles  |
|---------------|---------------------|
| 10 iterations | 13 + 10 = 23 cycles |
| N iterations  | 13 + N cycles       |



A REAL PROPERTY OF THE PARTY OF

Loop iterations affect the runtime **additively**, regardless of body content



#### **Adding loops**



The state of the second second second second

#### **Pipeline stalls in practice**

Why do we worry so much? Just set I = 1...

Increased II is commonly found in the wild:

- 1. Intra-iteration (now):
  - Multiple accesses to the same interface
- 2. Inter-iteration (later)
  - Data dependencies
- 3. Low throughput requirements (rare)
  - e.g. input only received every 16 cycles

```
for (int i = 1; i < N - 1; ++i) {
    #pragma HLS PIPELINE ||=1 ||=3
    res[i] = 0.3333 *
      (arr[i-1] + arr[i] + arr[i+1]);
}</pre>
```

Contraction and the second





#### Fast memory everywhere



and the second se



#### **Inserting registers**

for (int i = 1; i < N - 1; ++i) {
 res[i] = 0.3333 \*
 (arr[i-1] + arr[i] + arr[i+1]);
}</pre>

All and and and the second





#### ≥ Two classes of storage



and the second and

Useful to think of memory in terms of **depth** (*D*) and **width** (*W*)



#### **Buffer depth**

**Memory arrangement** 

all the sectors the

1D stencil program: W = 2, D = 1

2D row-major:W = 2, D = N



#### **Modified example**



A State of the second sec



#### Thinking in width and depth



a distance of the second s



#### **Finite state machine**



A State of the sta



#### **Draining and initialization phases**





#### Flattening



A CONTRACTOR OF A CONTRACTOR



**Multiple concurrent pipelines** 

#### $(L_0 + ((I_0 + (I_0 +$





#### **Streams**



A CALL CONTRACTOR



\_

\_

#### **Processing elements**



and the second second



#### **Properties of the global pipeline**



What goes in must come out:

Every stream write needs a corresponding read



$$L_{\text{tot}} = I N + L_0 + L_1 + L_3 \approx I N$$

$$I, L_0 \qquad I, L_1 \qquad I, L_3$$

$$I, L_0 \qquad I, L_2$$

#### Depth is "free":

In a perfect pipeline for large *N*, the influence of pipeline latency is negligible w.r.t. the total runtime



#### Unrolling

Much stronger meaning on FPGA than for an instruction-based architecture.

```
for (int w = 0; w < 4; ++w) {
    #pragma HLS PIPELINE II=1
    res[w] = a[w] + b[w];
}</pre>
```



#pragmaghtas-PUS EUPBELINE II=1 **for** (ries[W]=@[W] ← &[0]+w) { #presma+Has]UNROLL res[eve][2]a[va][2]b[tv][2];res[3] = a[3] + b[3];In a pipelined section, every "instruction" is separate hardware

Martin Participante and





A STATE AND A STATE OF A STATE



#### **Matrix-matrix multiplication**



and the second and a



#### **Solving loop-carried data dependencies**





#### Locality in the program

```
for (int n = 0; n < N; ++n) {
  float acc[P];
                                                Temporal locality:
                                               Reused P times. Load
  for (int k = 0; k < K; ++k) {
                                              more of these and tile N!
    const auto a = A[n^{*}K + k];
    for (int m = 0; m < M; ++m) -
                                                Spatial locality: Vectorizable.
      #pragma HLS PIPELINE II=1
      const float prev = (k == 0) ? 0 : acc[m];
      acc[m] = prev + a * B[k*M + m];
  // ...
```

Contraction of the Party of







#### **Fanout issue**



to the second



#### **Breaking into processing elements**

A and B only accessed at head

**Constant** number of connections per PE!!

## **Implemented in** example 7

State and the second second

## Concepts enough to "fully" utilize an FPGA 🙂

Only local communication

remains!



Sometimes referred to as a **systolic array**.



#### **Remaining optimization potential**



Fully optimized implementation at: <u>https://github.com/spcl/gemm\_hls</u>

and the second server that



#### **Summary**



A REAL PROPERTY OF





## Thank you for your attention 😊

Reach out at: <u>definelicht@inf.ethz.ch</u>

For a detailed description of HLS transformations, see:

Transformations of High-Level Synthesis Codes for High-Performance Computing https://arxiv.org/abs/1805.08288



#### **Two flavors of systolic arrays**



A MARY PARTY CONTRACTOR



spcl.inf.ethz.ch

#### Approaching a new problem



CT2 MA

#### When HDL should be involved



The second second





#### FPGAs are more energy efficient than GPUs

a characteristic man



#### Claim

#### FPGAs are more energy efficient than GPUs

#### ompute throughput

At  $F_{\text{FPGA}} = F_{\text{GPU}}$ , we are looking at a **werdict improvement** in energy efficiency.

Only if the performance is competitive

For a 10× slowdown, we **lose** a factor 2× in energy efficiency

Might seem trivial... F matters!

(for now, accept these numbers for the sake of argument)



#### **Types on FPGA**



Same concepts, different latencies (some problems go away at L = 1).