Table of Contents
Fetching ...

Accelerating Sparse DNNs Based on Tiled GEMM

Cong Guo, Fengchen Xue, Jingwen Leng, Yuxian Qiu, Yue Guan, Weihao Cui, Quan Chen, Minyi Guo

TL;DR

This paper tackles the bottleneck of accelerating sparse DNNs on commodity hardware by introducing tile-wise sparsity (TW) and its hybrid, tile-vector-wise (TVW) pattern to align pruning with tiled GEMM execution. The approach preserves regular tile-level structure to maintain GEMM compatibility while allowing irregular pruning across tiles for high accuracy, and it fuses TW with the sparse tensor core's vector-wise pattern to achieve finer granularity. A multi-stage pruning algorithm coupled with an efficient GPU implementation (including memory coalescing, load-balancing through batching, and kernel fusion) yields significant speedups, demonstrated across CNNs, NMT, and BERT on NVIDIA A100 GPUs, with TVW extending speedups over TW and other sparsity patterns. The results indicate substantial practical impact for deploying high-sparsity DNNs on existing dense accelerators without hardware modifications, improving latency by up to factors reported in their evaluations while maintaining accuracy.

Abstract

Network pruning can reduce the computation cost of deep neural network (DNN) models. However, sparse models often produce randomly-distributed weights to maintain accuracy, leading to irregular computations. Consequently, unstructured sparse models cannot achieve meaningful speedup on commodity hardware built for dense matrix computations. Accelerators are usually modified or designed with structured sparsity-optimized architectures for exploiting sparsity. For example, the Ampere architecture introduces a sparse tensor core, which adopts the 2:4 sparsity pattern. We propose a pruning method that builds upon the insight that matrix multiplication generally breaks the large matrix into multiple smaller tiles for parallel execution. We present the tile-wise sparsity pattern, which maintains a structured sparsity pattern at the tile level for efficient execution but allows for irregular pruning at the global scale to maintain high accuracy. In addition, the tile-wise sparsity is implemented at the global memory level, and the 2:4 sparsity executes at the register level inside the sparse tensor core. We can combine these two patterns into a tile-vector-wise (TVW) sparsity pattern to explore more fine-grained sparsity and further accelerate the sparse DNN models. We evaluate the TVW on the GPU, achieving averages of $1.85\times$, $2.75\times$, and $22.18\times$ speedups over the dense model, block sparsity, and unstructured sparsity.

Accelerating Sparse DNNs Based on Tiled GEMM

TL;DR

This paper tackles the bottleneck of accelerating sparse DNNs on commodity hardware by introducing tile-wise sparsity (TW) and its hybrid, tile-vector-wise (TVW) pattern to align pruning with tiled GEMM execution. The approach preserves regular tile-level structure to maintain GEMM compatibility while allowing irregular pruning across tiles for high accuracy, and it fuses TW with the sparse tensor core's vector-wise pattern to achieve finer granularity. A multi-stage pruning algorithm coupled with an efficient GPU implementation (including memory coalescing, load-balancing through batching, and kernel fusion) yields significant speedups, demonstrated across CNNs, NMT, and BERT on NVIDIA A100 GPUs, with TVW extending speedups over TW and other sparsity patterns. The results indicate substantial practical impact for deploying high-sparsity DNNs on existing dense accelerators without hardware modifications, improving latency by up to factors reported in their evaluations while maintaining accuracy.

Abstract

Network pruning can reduce the computation cost of deep neural network (DNN) models. However, sparse models often produce randomly-distributed weights to maintain accuracy, leading to irregular computations. Consequently, unstructured sparse models cannot achieve meaningful speedup on commodity hardware built for dense matrix computations. Accelerators are usually modified or designed with structured sparsity-optimized architectures for exploiting sparsity. For example, the Ampere architecture introduces a sparse tensor core, which adopts the 2:4 sparsity pattern. We propose a pruning method that builds upon the insight that matrix multiplication generally breaks the large matrix into multiple smaller tiles for parallel execution. We present the tile-wise sparsity pattern, which maintains a structured sparsity pattern at the tile level for efficient execution but allows for irregular pruning at the global scale to maintain high accuracy. In addition, the tile-wise sparsity is implemented at the global memory level, and the 2:4 sparsity executes at the register level inside the sparse tensor core. We can combine these two patterns into a tile-vector-wise (TVW) sparsity pattern to explore more fine-grained sparsity and further accelerate the sparse DNN models. We evaluate the TVW on the GPU, achieving averages of , , and speedups over the dense model, block sparsity, and unstructured sparsity.
Paper Structure (45 sections, 14 figures)

This paper contains 45 sections, 14 figures.

Figures (14)

  • Figure 1: The Ampere GPU architecture a100 introduces the sparse tensor core with 2:4 (2-out-of-4) vector-wise sparsity based on the dense tensor core.
  • Figure 2: Comparison of six patterns with sparsity.
  • Figure 3: The overview of TW sparsity pattern that exploits the tiled GEMM to maintain the GEMM-compatible execution.
  • Figure 4: The multi-stage pruning algorithm.
  • Figure 5: EW, VW and BW pruning algorithm.
  • ...and 9 more figures