Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators

Paolo D'Alberto; Taehee Jeong; Akshai Jain; Shreyas Manjunath; Mrinal Sarmah; Samuel Hsu; Yaswanth Raparti; Nitesh Pipralia

Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators

Paolo D'Alberto, Taehee Jeong, Akshai Jain, Shreyas Manjunath, Mrinal Sarmah, Samuel Hsu, Yaswanth Raparti, Nitesh Pipralia

TL;DR

The paper tackles the high computational cost of large CNNs by introducing weight block sparsity as a hardware-friendly, structured sparsity compatible with block-based accelerators. It develops a full stack: training strategies to obtain and maintain block sparsity, a quantization workflow, and a compiler/code-generation pipeline that maps sparse CNNs onto AIE2 tensor cores with explicit memory plans. Key contributions include a formalization of block sparsity with tunable block granularities, multiple training schemes (including incremental, optimization-based, and predetermined sparsity), and a hardware-aware engine that performs depth-wise tiling to minimize data movement and exploit locality. The results show meaningful speedups on CNNs like ResNet-50 with roughly 50% sparsity and demonstrate the system's potential for hardware-software co-design on AIE2 overlays, with implications for efficient CNNs on specialized AI accelerators and potential applicability to broader DL workloads.

Abstract

Nowadays, increasingly larger Deep Neural Networks (DNNs) are being developed, trained, and utilized. These networks require significant computational resources, putting a strain on both advanced and limited devices. Our solution is to implement {\em weight block sparsity}, which is a structured sparsity that is friendly to hardware. By zeroing certain sections of the convolution and fully connected layers parameters of pre-trained DNN models, we can efficiently speed up the DNN's inference process. This results in a smaller memory footprint, faster communication, and fewer operations. Our work presents a vertical system that allows for the training of convolution and matrix multiplication weights to exploit 8x8 block sparsity on a single GPU within a reasonable amount of time. Compilers recognize this sparsity and use it for both data compaction and computation splitting into threads. Blocks like these take full advantage of both spatial and temporal locality, paving the way for fast vector operations and memory reuse. By using this system on a Resnet50 model, we were able to reduce the weight by half with minimal accuracy loss, resulting in a two-times faster inference speed. We will present performance estimates using accurate and complete code generation for AIE2 configuration sets (AMD Versal FPGAs) with Resnet50, Inception V3, and VGG16 to demonstrate the necessary synergy between hardware overlay designs and software stacks for compiling and executing machine learning applications.

Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators

TL;DR

Abstract

Paper Structure (26 sections, 16 equations, 10 figures, 2 tables)

This paper contains 26 sections, 16 equations, 10 figures, 2 tables.

Introduction
Related Works
Block Sparsity in a Matrix
Block-Sparse Matrix-Matrix Multiplication
Block Sparsity: Training and Quantization
Searching Optimum Sparsity ratio
Sparsity Ratio as Incremental
Sparsity Ratio as Trainable as Optimization Problem
Hessian and Fisher Information
Diagonal Hessian
Predetermined Sparsity ratio and Full Training Ahead
Compiler and its Code generation
Hardware Abstraction
Subvolumes, Data Compression, and Data Movements
Schedule and Memory Allocation
...and 11 more sections

Figures (10)

Figure 1: Visualization of dense and block-sparse weight matrix, zero blocks are green without variations
Figure 2: Example of block sparsity $\Gamma({\bar{\bf W}},8\times 8)$, ${\bar{\bf W}}$, and ${\bf W}$
Figure 3: Block 1x1 and 8x8 performance
Figure 4: 4x4 AIE representation
Figure 5: Resnet single convolution with padding for 4x4: LOAD activation from DDR to Memtile, LOADW weights from DDR to Memtile, LOADFM activation from Memtile to Tensor cores, LOADWM weights from Memtile to Tensor cores, WRITE from Memtile to DDR, WRITEFM from Tensor Cores to Memtile, COMP Computation in this case a convolution.
...and 5 more figures

Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators

TL;DR

Abstract

Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators

Authors

TL;DR

Abstract

Table of Contents

Figures (10)