How to use a CGRA without even knowing about it - AMIDAR Processors and its Successor

Speaker: Christian Hochberger - TU Darmstadt

Abstract: Coarse grained reconfigurable architectures (GGRA) have gained a lot of attention in recent years. Yet, integrating them into target systems and programming them is still a major effort. In this talk, I will show how CGRAs can be used for the acceleration of general purpose processors. I will start with a retrospective of our first system incorporating a CGRA, the AMIDAR processors. Then I will show, how we integrate a CGRA into a RISC-V based system and dynamically map RISC-V assembly kernels to the CGRA. I will explain the challenges of this process, which challenges we have already solved (and how), and which challenges still remain open.

Short bio

Christian Hochberger is a full professor for Computer Systems in the EE&CE Department of TU Darmstadt since 2012. Before, he was associate professor for Embedded Systems in the computer science department at TU Dresden since 2003. He got his diploma and PhD in computer science from TU Darmstadt. Between his PhD and his appointment at TU Dresden, he worked several years as a consultant for embedded systems and FPGA projects in commercial projects. His research focus is on reconfigurable technologies. His main goal is to make the technological progress available to non specialists. Thus, tools to efficiently and easily program FPGAs and CGRAs are developed in his group.

An efficient and flexible stochastic CGRA mapping approach

Speaker: Satyajit Das - Indian Institute of Technology Palakkad

Abstract: Coarse-Grained Reconfigurable Array (CGRA) architectures are promising high-performance and power-efficient platforms. However, mapping applications efficiently on CGRA is a challenging task. This is known to be an NP complete problem. Hence, finding good mapping solutions for a given CGRA architecture within a reasonable time is complex. Additionally, finding scalability in compilation time and memory footprint for large heterogeneous CGRAs is also a well known problem. In this work, we present a stochastic mapping approach that can efficiently explore the architecture space and allows finding best of solutions while having limited and steady use of memory footprint. Experimental results show that our compilation flow allows to reach performances with low-complexity CGRA architectures that are as good as those obtained with more complex ones thanks to the better exploration of the mapping solution space. Parameters considered in our experiments are number of tiles, Register File (RF) size, number of load/store (LS) units, network topologies, and so on. Our results demonstrate that high-quality compilation for a wide range of applications is possible within reasonable run-times. Experiments with several DSP benchmarks show that the best CGRA configuration from the architectural exploration surpasses an ultra low-power DSP optimized RISC-V CPU to achieve up to 15.28X (with an average of 6X and minimum of 3.4X) performance gain and 29.7X (with an average of 13.5X and minimum of 6.3X) energy gain with an area overhead of 1.5X only.

Short bio

Satyajit Das is an Assistant Professor in the Department of Data Science, and Computer Science and Engineering at IIT Palakkad, India. He received his joint Ph.D. degree from the University of South Brittany, France, and the University of Bologna, Italy. Prior to joining IIT Palakkad, he was a postdoctoral fellow at LabSTICC, UBS. His research spans the areas of Systems for AI, architecture, methods, and tools for low power systems. These include CGRAs, custom processors, multi-cores, high-level synthesis, and compilers. The main focus of Dr. Das's research is to implement highly energy-efficient solutions for digital architectures in the domain of heterogeneous and reconfigurable multi-core System on Chips (SoCs).

Most Resource Efficient Matrix Vector Multiplication on FPGAs (for Deep Learning Applications)

Speaker: Marc Reichenbach - Brandenburg University of Technology

Abstract: Fast and resource-efficient inference in artificial neural networks (ANNs) is of utmost importance and drives many new developments in the area of new hardware architectures, e.g., by means of systolic arrays or algorithmic optimization such as pruning. In this talk, we present a novel method for lowering the computation effort for ANN inference utilizing ideas from information theory. Weight matrices are sliced into submatrices of logarithmic aspect ratios. These slices are then factorized. This reduces the number of required computations without compromising on fully parallel processing. We create a new hardware architecture for this dedicated purpose. We also provide a tool to map these sliced and factorized matrices efficiently to FPGAs. By comparing to the state of the art FPGA implementations, we can prove our claim by lowering hardware resources by a factor of four to six. Our method does not rely on any particular property of the weight matrices of the ANN. It works for the general task of multiplying an input vector with a constant matrix and is also suitable for digital signal processing beyond ANNs.

Mixing analog and digital reconfiguration to achieve low energy and high performance in CNNs

Speaker: Luigi Carro - Instituto de Informática, UFRGS

Abstract:Several works on reconfigurable devices have been shown to increase performance and reduce energy of complex neural networks like CNNs. Another approach has been the use of really low energy reram devices that can compute matrix multiplication in the analog domain. Such analog circuits have a much lower energy consumption, but also have severe scalability problems, given current fabrication technology. In this talk we will discuss techniques to merge analog and digital programmable devices, in order to achieve low energy and scalability of the solution.

Developing HPC open source libraries: The OPTIMA experience

Speaker: Dionisis Pnevmatikatos - National Technical University of Athens

Abstract:Reconfigurable technology has been successfully showcased in several computationally intensive applications that exploit the underlying adaptability to extract performance. When applying this technology to more general HPC environments, the necessary tradeoffs and performance tuning are more challenging. I will describe our progress towards a proven set of open source libraries or typical HPC kernels, starting from basic ones (BLAS L1), and gradually moving towards the more involved, interesting and difficult ones: (e.g. BLAS L2 & L2, SpMv, Jacobi, LU decomposition). This work is performed in the context of the OPTIMA EuroHPC project.

Implementing AI Robotic Algorithms in FPGAs

Speaker: Yiannis Papaefstathiou - ECE School, Aristotle University of Thessaloniki

Abstract:In this talk I will present two different AI-based applications which are accelerated when executed on FPGA systems; one of them is a Cloud-based robotic simulator while the other is an edge system for recognizing damages and failures in Electricity Grids using UAVs. The simulator uses deep learning; in particular we evaluate first a multi-layer perceptrons (MLP) inference running on the Jumax CPU and on the Jumax DataFlow Engines (DFEs) from Maxeler using their Dataflow Computing model. For the edge application we have an optimized version of u-net executed on a specially designed board for robotics application utilizing an Ultrascale+ FPGA.

Variable precision sparse-dense matrix processing in Tensorflow Lite with dynamic reconfiguration

Speaker: Jose Nunez-Yanez - Linkoping University

Abstract:In this talk we present a dynamically reconfigurable hardware accelerator called FADES (Fused Architecture for DEnse and Sparse matrices). The FADES design offers multiple configuration options that trade off parallelism and complexity using a dataflow model to create stages that read, compute, scale and write results. FADES is mapped to the programmable logic (PL) and integrated with the TensorFlow Lite inference engine running on the processing system (PS) of a heterogeneous SoC device. The accelerator is used to compute the tensor operations, while the dynamically reconfigurable approach can be used to switch precision between TFLite int8 and float modes. This dynamic reconfiguration enables better performance by allowing more cores to be mapped to the resource-constrained device and lower power consumption compared with supporting both arithmetic precisions simultaneously. We compare the proposed hardware with a high-performance systolic architecture for dense matrices obtaining 25% better performance in dense mode with half the DSP blocks in the same technology. In sparse mode, we show that the core can outperform dense mode even at low sparsity levels, and a single-core achieves up to 20x acceleration over the software-optimized NEON RUY library.

Short bio

Dr Nunez-Yanez is a Professor at Linkoping University, Sweden in energy efficient and adaptive hardware architectures for machine learning. Previous to that he was a reader at the University of Bristol, UK. He holds a PhD in hardware-based parallel data compression from the University of Loughborough, UK. His main area of expertise is in the design of reconfigurable architectures for signal processing with a focus on run-time adaptation, parallelism and energy-efficiency. In 2006-2007 he was a Marie Curie research fellow at STM, Italy and in 2011 he was a Royal Society research fellow at ARM Ltd, Cambridge. In 2020-2022 he was an industrial research fellow with the Royal Society at Sensata technologies.

Speculative Loop Pipelining in High-Level Synthesis

Speaker: Steven Derrien - IRISA/INRIA/Université de Rennes

Abstract:Custom hardware accelerators usage is shifting toward new application domains such as graph analytics and unstructured text analysis. These applications expose complex control-flow which is challenging to map to hardware, especially when operating from a C/C++ description using high-level synthesis toolchains. In particular, Loop pipelining (LP) is a key optimization in modern high-level synthesis (HLS) tools for synthesizing efficient hardware datapaths. Existing techniques for automatic LP are limited by static analysis that cannot precisely analyze loops with data-dependent control flow and/or memory accesses. We propose a technique for speculative LP that handles both control-flow and memory speculations in a unified manner. Our approach is entirely expressed at the source level, allowing a seamless integration to development flows using HLS. Our evaluation shows significant improvement in throughput over standard LP.

Short bio

Steven Derrien is professor at University of Rennes, France, and researcher at IRISA/INRIA. His research interests include compiler technique for High-Level-Synthesis and FPGA based hardware accelerators.