Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators
Paolo D'Alberto, Taehee Jeong, Akshai Jain, Shreyas Manjunath, Mrinal Sarmah, Samuel Hsu, Yaswanth Raparti, Nitesh Pipralia
TL;DR
The paper tackles the high computational cost of large CNNs by introducing weight block sparsity as a hardware-friendly, structured sparsity compatible with block-based accelerators. It develops a full stack: training strategies to obtain and maintain block sparsity, a quantization workflow, and a compiler/code-generation pipeline that maps sparse CNNs onto AIE2 tensor cores with explicit memory plans. Key contributions include a formalization of block sparsity with tunable block granularities, multiple training schemes (including incremental, optimization-based, and predetermined sparsity), and a hardware-aware engine that performs depth-wise tiling to minimize data movement and exploit locality. The results show meaningful speedups on CNNs like ResNet-50 with roughly 50% sparsity and demonstrate the system's potential for hardware-software co-design on AIE2 overlays, with implications for efficient CNNs on specialized AI accelerators and potential applicability to broader DL workloads.
Abstract
Nowadays, increasingly larger Deep Neural Networks (DNNs) are being developed, trained, and utilized. These networks require significant computational resources, putting a strain on both advanced and limited devices. Our solution is to implement {\em weight block sparsity}, which is a structured sparsity that is friendly to hardware. By zeroing certain sections of the convolution and fully connected layers parameters of pre-trained DNN models, we can efficiently speed up the DNN's inference process. This results in a smaller memory footprint, faster communication, and fewer operations. Our work presents a vertical system that allows for the training of convolution and matrix multiplication weights to exploit 8x8 block sparsity on a single GPU within a reasonable amount of time. Compilers recognize this sparsity and use it for both data compaction and computation splitting into threads. Blocks like these take full advantage of both spatial and temporal locality, paving the way for fast vector operations and memory reuse. By using this system on a Resnet50 model, we were able to reduce the weight by half with minimal accuracy loss, resulting in a two-times faster inference speed. We will present performance estimates using accurate and complete code generation for AIE2 configuration sets (AMD Versal FPGAs) with Resnet50, Inception V3, and VGG16 to demonstrate the necessary synergy between hardware overlay designs and software stacks for compiling and executing machine learning applications.
