Table of Contents
Fetching ...

Sparse Computations in Deep Learning Inference

Ioanna Tasou, Panagiotis Mpakos, Angelos Vlachos, Dionysios Adamopoulos, Georgios Giannakopoulos, Konstantinos Katsikopoulos, Ioannis Karaparisis, Maria Lazou, Spyridon Loukovitis, Areti Mei, Anastasia Poulopoulou, Angeliki Dimitriou, Giorgos Filandrianos, Dimitrios Galanopoulos, Vasileios Karampinis, Ilias Mitsouras, Nikolaos Spanos, Petros Anastasiadis, Ioannis Doudalis, Konstantinos Nikas, George Retsinas, Paraskevi Tzouveli, Christina Giannoula, Nectarios Koziris, Nikela Papadopoulou, Giorgos Stamou, Athanasios Voulodimos, Georgios Goumas

TL;DR

This survey addresses the problem of high inference costs in deep neural networks by focusing on sparsity as a practical route to reduce compute and energy use. It surveys the forms of sparsity (unstructured, semi-structured N:M, and activation/attention sparsity), explains how dense computations map to sparse kernels, and reviews the state-of-the-art kernels and hardware support on CPUs and GPUs. It also surveys datasets (notably DLMC), software tooling, and presents evaluation results for SpMM and SDDMM to shed light on performance trends and integration challenges. The work provides a practitioner-oriented resource to guide the design and deployment of highly efficient sparse DNNs in production environments, highlighting remaining gaps and opportunities for hardware-software co-design.

Abstract

The computational demands of modern Deep Neural Networks (DNNs) are immense and constantly growing. While training costs usually capture public attention, inference demands are also contributing in significant computational, energy and environmental footprints. Sparsity stands out as a critical mechanism for drastically reducing these resource demands. However, its potential remains largely untapped and is not yet fully incorporated in production AI systems. To bridge this gap, this work provides the necessary knowledge and insights for performance engineers keen to get involved in deep learning inference optimization. In particular, in this work we: a) discuss the various forms of sparsity that can be utilized in DNN inference, b) explain how the original dense computations translate to sparse kernels, c) provide an extensive bibliographic review of the state-of-the-art in the implementation of these kernels for CPUs and GPUs, d) discuss the availability of sparse datasets in support of sparsity-related research and development, e) explore the current software tools and frameworks that provide robust sparsity support, and f) present evaluation results of different implementations of the key SpMM and SDDMM kernels on CPU and GPU platforms. Ultimately, this paper aims to serve as a resource for performance engineers seeking to develop and deploy highly efficient sparse deep learning models in productions.

Sparse Computations in Deep Learning Inference

TL;DR

This survey addresses the problem of high inference costs in deep neural networks by focusing on sparsity as a practical route to reduce compute and energy use. It surveys the forms of sparsity (unstructured, semi-structured N:M, and activation/attention sparsity), explains how dense computations map to sparse kernels, and reviews the state-of-the-art kernels and hardware support on CPUs and GPUs. It also surveys datasets (notably DLMC), software tooling, and presents evaluation results for SpMM and SDDMM to shed light on performance trends and integration challenges. The work provides a practitioner-oriented resource to guide the design and deployment of highly efficient sparse DNNs in production environments, highlighting remaining gaps and opportunities for hardware-software co-design.

Abstract

The computational demands of modern Deep Neural Networks (DNNs) are immense and constantly growing. While training costs usually capture public attention, inference demands are also contributing in significant computational, energy and environmental footprints. Sparsity stands out as a critical mechanism for drastically reducing these resource demands. However, its potential remains largely untapped and is not yet fully incorporated in production AI systems. To bridge this gap, this work provides the necessary knowledge and insights for performance engineers keen to get involved in deep learning inference optimization. In particular, in this work we: a) discuss the various forms of sparsity that can be utilized in DNN inference, b) explain how the original dense computations translate to sparse kernels, c) provide an extensive bibliographic review of the state-of-the-art in the implementation of these kernels for CPUs and GPUs, d) discuss the availability of sparse datasets in support of sparsity-related research and development, e) explore the current software tools and frameworks that provide robust sparsity support, and f) present evaluation results of different implementations of the key SpMM and SDDMM kernels on CPU and GPU platforms. Ultimately, this paper aims to serve as a resource for performance engineers seeking to develop and deploy highly efficient sparse deep learning models in productions.

Paper Structure

This paper contains 102 sections, 25 equations, 28 figures, 5 tables.

Figures (28)

  • Figure 1: The structure of a basic neural network layer, followed by the activation function ($\phi$).
  • Figure 2: The generic architecture of a DNN.
  • Figure 3: Convolution example: A $3\times 3$ input $I$ is convolved with a $2\times 2$ kernel $K$ using a sliding window to produce the $2\times 2$ output $O$.
  • Figure 4: A basic RNN cell, where the hidden state is updated from the previous hidden state and current input, using shared parameters across timesteps.
  • Figure 5: Structure of an LSTM cell with its gating mechanisms and cell state update.
  • ...and 23 more figures