A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms

Cristina Silvano; Daniele Ielmini; Fabrizio Ferrandi; Leandro Fiorin; Serena Curzel; Luca Benini; Francesco Conti; Angelo Garofalo; Cristian Zambelli; Enrico Calore; Sebastiano Fabio Schifano; Maurizio Palesi; Giuseppe Ascia; Davide Patti; Nicola Petra; Davide De Caro; Luciano Lavagno; Teodoro Urso; Valeria Cardellini; Gian Carlo Cardarilli; Robert Birke; Stefania Perri

A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms

Cristina Silvano, Daniele Ielmini, Fabrizio Ferrandi, Leandro Fiorin, Serena Curzel, Luca Benini, Francesco Conti, Angelo Garofalo, Cristian Zambelli, Enrico Calore, Sebastiano Fabio Schifano, Maurizio Palesi, Giuseppe Ascia, Davide Patti, Nicola Petra, Davide De Caro, Luciano Lavagno, Teodoro Urso, Valeria Cardellini, Gian Carlo Cardarilli, Robert Birke, Stefania Perri

TL;DR

The survey maps the landscape of hardware accelerators for deep learning across heterogeneous HPC platforms, detailing GPUs, TPUs, ASICs, FPGAs, RISCV-based accelerators, and emerging paradigms such as PIM/IMC, neuromorphic systems, multi-chip modules, and quantum/photonic computing. It provides a structured taxonomy, comparing architectures, dataflows, memory hierarchies, and energy-performance trade-offs, while highlighting the role of sparsity and memory technologies in accelerating GEMM and convolution workloads. Key contributions include a comprehensive synthesis of state-of-the-art accelerators, a RISCV-centric view spanning from edge to HPC, and a discussion of challenges in integration, scaling, and efficiency. The practical impact lies in guiding researchers and practitioners to select, design, and optimize accelerators that meet the diverse compute and energy requirements of modern deep learning workloads across the HPC spectrum.

Abstract

Recent trends in deep learning (DL) have made hardware accelerators essential for various high-performance computing (HPC) applications, including image classification, computer vision, and speech recognition. This survey summarizes and classifies the most recent developments in DL accelerators, focusing on their role in meeting the performance demands of HPC applications. We explore cutting-edge approaches to DL acceleration, covering not only GPU- and TPU-based platforms but also specialized hardware such as FPGA- and ASIC-based accelerators, Neural Processing Units, open hardware RISC-V-based accelerators, and co-processors. This survey also describes accelerators leveraging emerging memory technologies and computing paradigms, including 3D-stacked Processor-In-Memory, non-volatile memories like Resistive RAM and Phase Change Memories used for in-memory computing, as well as Neuromorphic Processing Units, and Multi-Chip Module-based accelerators. Furthermore, we provide insights into emerging quantum-based accelerators and photonics. Finally, this survey categorizes the most influential architectures and technologies from recent years, offering readers a comprehensive perspective on the rapidly evolving field of deep learning acceleration.

A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms

TL;DR

Abstract

Paper Structure (37 sections, 14 figures, 11 tables)

This paper contains 37 sections, 14 figures, 11 tables.

Introduction
Deep Learning Background
GPU- and TPU-based accelerators
GPU-based accelerators
TPU-based accelerators
Hardware Accelerators
Reconfigurable Hardware Accelerators
ASIC-based Accelerators
Accelerators for Sparse Matrices
Accelerators based on open-hardware RISC-V
RISC-V ISA extensions for (Deep) Learning
RISC-V Vector Co-processors
RISC-V Memory-coupled Neural Processing Units (NPUs)
Summary
Accelerators based on Emerging Technologies
...and 22 more sections

Figures (14)

Figure 1: Organization of the survey
Figure 2: Overview on state-of-the-art Neural Network accelerators based on available data collected in Guo_Online. Legenda: Simulation means GOPS/W values collected from post-layout simulation; Test means from prototype devices; Product means from off-the-shelf devices.
Figure 3: Dataflows in DL accelerators: (a) Weights stationary; (b) Output stationary; (c) Input stationary.
Figure 4: Taxonomy of RISC-V based acceleration units discussed in Section \ref{['sec:riscv']}
Figure 5: Performance and power consumption of SotA DL accelerators based on open-HW RISC-V.
...and 9 more figures

A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms

TL;DR

Abstract

A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms

Authors

TL;DR

Abstract

Table of Contents

Figures (14)