



# FPGA or CGRA? Reconfigurable Architectures Suitable for High-Performance Computing

Kentaro Sano RIKEN Center for Computational Science (R-CCS)

## **Summary of this Talk**

• Reconfigurable data-flow computing as promising architecture for HPC

#### • FPGA

- ESSPER : Elastic and scalable FPGA-cluster system for high-performance reconfigurable computing, as Prototype FPGA cluster for HPC
- ✓ Lessons we learned
- CGRA (Coarse-grained reconfigurable array)
  - ✓ RIKEN CGRA research
  - ✓ What we have studied so far.



## Introduction

#### • World ranking of supercomputers

- ✓ TOP500: Ranking of HPL performance
- ✓ CPU-based vs. GPU/Acc-based
- Perf improvement slowed down around 2015.
- System performance is limited by system power.
  - ✓ Reached tens of MW
    - (Fugaku: 30MW, Frontier: 21MW for HPL)
  - Not easy to further increase (100MW is not real for SDGs & cost.)



With capped power budget, need to increase performance per power



## What Eats Power?

#### • Data movement rather than computing

 We should remove unnecessary data movement, and make it shorter.

#### • Unsuitable architecture

#### with low efficiency and scalability

- von-Neumann architectures (CPU & GPU) cannot efficiently scale due to
  - memory-bottlenecked structure; such as register files and NoC w/ LLC for multiple cores
  - Extra mechanisms consuming power just to increase IPC, such as out-of-order and branch predictor.

#### Semiconductor scaling cannot save it.

 Power improvement per generation is limited while can still increase transistors per area for advanced tech nodes like 5, 3, 2, and 1.5nm ...

#### **Communication Dominates Arithmetic**





## What can Save Us?

### Data-flow computing architecture

- ✓ Localized data-movement
- No memory bottleneck; distributed and pipelined ALUs with regular/simplified memory access
  No extra mechanisms for non-computing
- Circuit reconfigurability is key.
  - ✓ Giving programmability
  - Higher efficiency for target problems
- What candidate technologies for reconfigurable data-flow?
  FPGA and CGRA?







# **Goal and Roadmap of Processor Research Team**

### **Goal: Establish HPC architectures suitable in Post-Moore Era**



WRC2024: FPGA or CGRA

#### 1. Advancement of Fugaku

- Functional extension with FPGAs (FPGA cluster, ESSPER)
- SoC, system software, applications





# **Prototype FPGA Cluster**

# for Supercomputer Fugaku

### **Open-Access paper**





### **This Work**



#### Goal : Design & demonstrate a proof-of-concept FPGA cluster for HPC research

• **ESSPER** : Elastic and scalable FPGA-cluster system for high-performance reconfigurable computing

### Contributions

- ✓ Design concept of FPGA cluster for HPC
- Classification of FPGA cluster architectures
- Proposed system stack with software-bridged APIs
- ✓ Implementation and evaluation for FPGA-based extension of the world's top-class supercomputer, Fugaku

#### **Open-Access paper**







### FPGAs have yet to be Mainstream in HPC.

System architecture not matured yet.

Still have system-level challenges for FPGA-based HPC

**Productive customizability for computing HW** 

**Performance scalability** with multiple FPGAs







WRC2024: FPGA or CGRA

### **Challenges and Approaches for FPGA-based HPC**

#### **Productive customizability for computing HW**

 ✓ Able to implement various hardware (algorithms) on FPGA



#### **Performance scalability** with multiple FPGAs

- ✓ Inter-FPGA communication available
- ✓ Allow users to easily try multi-FPGA applications

#### **Interoperability** with existing HPC systems

- ✓ Able to easily extend existing systems with FPGAs
- ✓ Can we extend Supercomputer Fugaku?

10



- No OpenCL (not limit computing models)
- FPGA Shell & HLS/HDL programming, where any hardware can be easily implemented

FPGA Shell supporting high-bandwidth and low-latency network dedicated to FPGAs

Software-bridged APIs to access FPGAs remotely through host-FPGA bridging network



# **Architecture of ESSPER**



#### Productive customizability

- No OpenCL (not limit computing models)
- FPGA Shell & HLS/HDL programming, where any hardware can be easily implemented

### Performance scalability

FPGA Shell supporting high-bandwidth and low-latency network dedicated to FPGAs

### Interoperability

Software-bridged APIs to access FPGAs remotely through host-FPGA bridging network

### **Architecture Classification**

(S)M (Shared) memory

P CPU

F

FPGA NW Network

| System NW |     |     |    |   |
|-----------|-----|-----|----|---|
|           |     | 1   |    |   |
| Ρ         | Ρ   |     | Ρ  | Ρ |
| Μ         |     |     | SM |   |
|           | - I | ••• |    |   |
| М         | Μ   |     | F  | F |
| F         | F   |     |    |   |

a. Cluster of CPUs with FPGAs

(distributed or shared memory)

| System / FPGA NW |   |  |   |   |
|------------------|---|--|---|---|
|                  |   |  |   |   |
| Ρ                | Ρ |  | F | F |
| М                |   |  | М | Μ |





#### **Our architecture**

b. Cluster of CPUs and FPGAs

c. Clusters of CPUs with inter-connected FPGAs

R-CCS 12

# **Related Work : FPGAs Clusters in HPC/DC**

| FPGA NW Type           | Direct network                                  | Indirect network                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Indirect circuit-switching nw         |
|------------------------|-------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------|
| Characteristics        | p2p-connection without switches, typical: torus | connection with switches,<br>typical: Ethernet                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | connection with optical switch (MEMS) |
| Switching              | circuit or packet (w/ router)                   | packet                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | circuit or packet (w/ router)         |
| Pros                   | low latency                                     | flexibility, small diameter                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | low latency, flexibility              |
| Cons                   | inflexibility, large diameter                   | higher latency, complex                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | expensive, signal attenuation         |
| Representative systems |                                                 | Image: Constraint of the synthetic synt | blade<br>DRAM                         |



Cygnus @ U of Tsukuba

Archi-C



Archi-C



Archi-B



Archi-C





Elastic and Scalable System for High-Performance Reconfigurable Computing

# **System Design**



# Hardware Organization of ESSPER



Jan 17, 2024

**FPGA Shell** 

### **FPGA Shells for Direct and Indirect Networks**

#### **Direct connection network (DCN)**

#### Indirect network (VCSN)

CCIP

Avalon-MM

usr clk

AFU

**100G Ethernet** 

virtual circuit-

switching nw

220+ MHz



Tomohiro Ueno, Atsushi Koshiba, Kentaro Sano, "Virtual Circuit-Switching Network with Flexible Topology for High-Performance FPGA Cluster," Procs. of ASAP, pp.41-48, 2021.



WRC2024: FPGA or CGRA

# System Stack of ESSPER





### Remote-OPAE (for remote FPGA Access)

#### Software bridge for FPGAs over Infiniband

 ✓ OPAE: Open Programmable Acceleration Engine (PCIe FPGA driver)





### Remote-OPAE (for remote FPGA Access)

#### Software bridge for FPGAs over Infiniband

 ✓ OPAE: Open Programmable Acceleration Engine (PCIe FPGA driver)

**R-OPAE** daemon ✓ 99% of OPAE APIs **DMA Library** are supported. DMA API **R-OPAE C API** OPAE C API / DMA API IB **OPAE** library **EDR** ✓ We can use **R-OPAE** library **Enumeration** Access Management 100Gbps any FPGAs **Intel FPGA drivers** in a system via IB Verbs, RDMA **FME driver PORT/AFU driver** as if they were **FPGA PCIe driver** locally installed. **SGDMAs** Intel FPGA **Another server FPGA server** (x86)



### **R-OPAE** as Software-based Resource Disaggregation

#### **Transparent access to remote FPGAs**

### Flexible utilization:

Can use any available
FPGA resources

# Inter-operability and extensibility:

- ✓ Vendor/ISA-independent
- ✓ Operable with various architectures such as Fugaku (ARM)







Elastic and Scalable System for High-Performance Reconfigurable Computing

# **Proof-of-concept and evaluation**



### Supercomputer Fugaku



48+ cores / 1 node 2.7+ TF











WRC2024: FPGA or CGRA

#### Elastic and Scalable System for High-Performance Re-Configurable Computing





### Open-Access paper





Elastic and Scalable System for High-Performance Reconfigurable Computing

# Applications, Joint Research Projects



WRC2024: FPGA or CGRA

# **Ringed FPGAs for Deeper Pipelining**

### • Deeply-pipelined FPGAs with 1D ring

- ✓ Linear array of Stratix10 FPGAs
- Pipelining works well for almost linear speedup if data stream is sufficiently large.





Block diagram of FPGAs in a ring

Performance model for Arria10 FPGAs



## Performance of 2D LBM with 100Gbps Ring NW

Computational performance (FLOPS) when processing about 2GB data





# Lessons Learned with ESSPER

Open-Access paper



- FPGA-based reconfigurable computing works.
- Productivity is not high, especially for multiple FPGAs.

Even HLS requires know-how on optimizing computation and memory access.
Lack of debugging tool, and simulation environment.

• Can obtain scalability, but

absolute performance in FP is lower than competitors (GPUs)

- ✓ FPGA-bases system development takes time while GPUs are being further advanced.
- For fixed domain of computing (such as HPC in FP and AI workloads), FPGAs are redundant with more area, more power, and lower frequency with low memory bw.
- Concept should be Okay for reconfigurable data-flow computing, but implementation approach could be improved : CGRA instead of FPGA?





# **Exploration of New HPC Architectures**

Data-flow-based accelerators (CGRA)



### Coarse-Grained Reconfigurable Array (CGRA)

• Architecture for reconf. data-flow computing

- ✓ Composed of an array of processing elements (PEs), where we can map DFGs for computing
- ✓ Provide a word-wise reconfigurability (e.g., 32-bit)
- ✓ Higher energy efficiency than FPGAs (of bit-level)
- ✓ Performance close to ASIC-based accelerators

### • Application area of CGRAs

- Traditionally, targeted for lower-power embedded apps, e.g., image processing
- ✓ Recently, expected for hi-performance AI
- Questions

30

- ✓ CGRAs also promising for HPC?
- ✓ What architecture/design decision required HPC?









#### General structure of the CGRAs [2]

WRC2024: FPGA or CGRA

### **RIKEN CGRA Architecture (baseline)**

#### • HPC-oriented CGRA with the following design philosophy

- ✓ Modular design for design space exploration with various architecture configuration and sizes
- ✓ Isolation between computation in a PE array and memory access with load-and-store (LS) tiles
- ✓ Capability of floating-point operations for HPC apps





Past work: CGRA Designs with Embedded routers (ER) or Discrete routers (DR)



### **Design Decision on Intra-CGRA Interconnects**



**CGRA with Embedded Router (CGRA-ER)** 

Discrete router as switch block (SB)



#### **CGRA with Discrete Router (CGRA-DR)**

| Routers in each PE mediate communication between PEs                                                                                          | Discrete switch blocks for communication between PEs                                                                                                   |
|-----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|
| Simpler in design with smaller area and higher frequency                                                                                      | More complex design with large area and lower frequency                                                                                                |
| May lead to wastage of computing resources:<br>When mapping complex graphs, some PEs must be<br>configured to bypass data without computation | More efficient: allow more PEs to be used for computing:<br>By relieving PEs from compulsory data bypass with<br>better routability with switch blocks |
| Ex) ADRES, CGRA-ME, HiPreP, and MorphoSys                                                                                                     | Ex) HyCube, RAW (& the Tilera Processor), and Plasticine                                                                                               |

#### Which one is suitable for HPC considering computing efficiency and HW area?



### **Objective and Contributions**

• Objective Evaluate the positive and negative aspects of the different routing architectures: CGRA-ER and CGRA-DR for HPC apps

- Contributions
  - ✓ Parameterized implementation of CGRA-ER and CGRA-DR using our baseline architecture
  - Evaluation and verification of benchmarks (DFGs) by RTL simulation with CGRA-evaluation framework (our previous work)
  - ✓ Comparison between CGRA-ER and CGRA-DR
    - Difficulty in place-and-route of DFGs
    - PE utilization for computation
    - Hardware resource consumption





### CGRA with Embedded Router (CGRA-ER)



✓ PEs and LSs are directly connected to each other.

- ✓ Limited routing capability causes inefficient PE utilization.
- Wasted PEs for routing with NOP to bypass data through (No computing)

#### **Embedded multiplexors**





### CGRA with Discrete Router (CGRA-DR)



- ✓ Switch blocks as discrete router
- ✓ Simpler PE and LS tile with no multiplexor
- ✓ Higher hardware resources required for SBs

#### No embedded multiplexor





### **Routing Flexibility vs. Complex Kernel**

- Stockham Radix-5 FFT Kernel DFG has more edges connecting among nodes.
  - Could not map onto CGRA-ER ; routing with PEs is not enough.
  - ✓ CGRA-DR allows it to be mapped onto  $18 \times 16$  CGRA-DR and even smaller one ( $8 \times 16$ ).



DFG of the Innermost Loop of Stockham Radix 5 FFT kernel

Stockham Radix 5 FFT kernel mapped on CGRA-DR



## Future Work on CGRA for HPC

- Have not obtained conclusions yet.
- Extension of the baseline architecture for practical kernels
  - $\checkmark$  Predication for conditional execution



**Comparison with other architectures [1]** 

- Heterogeneous architecture with FP div, sqrt, log, and transcendental functions to cover wider range of applications
- ✓ Programmable buffer for data reuse (such like line/stencil buffer)
- Evaluation of operation frequency and hardware resource consumption with FPGA-based and/or ASIC implementation
  - $\checkmark$  Initial rough evaluation results for ASIC
  - ✓ FPGA implementation is also on-going.
    - > Supercomputer Fugaku CGRA emulation on ESSPER.



# Summary

Reconfigurable data-flow computing should be promising for power-efficient HPC.



#### Hiring researchers, Contact me!



#### ✓ FPGA-based HPC testbed; ESSPER (prototype FPGA cluster)

- Stratix 10 FPGAs
- > FPGA Shell with inter-FPGA network

#### ✓ RIKEN CGRA for HPC

- Baseline architecture
- > Design space exploration for inter-tile connection

#### Future work

- ✓ ESSPER2 with Intel Agilex-M FPGA
- ✓ CGRA for HPC and AI, (design for ASIC and compiler)
- ✓ Feasibility study for next-gen supercomputers (conducted)