Table of Contents
Fetching ...

Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch

Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, Hongsheng Li

TL;DR

The paper tackles the challenge of speeding up deep networks without sacrificing accuracy by introducing $N$:$M$ fine-grained structured sparsity trained from scratch. It extends the straight-through estimator with a sparse-refined term (SR-STE) and introduces Sparse Architecture Divergence (SAD) to quantify topology changes during training, demonstrating that SR-STE stabilizes learning and reduces SAD. Across image classification, object detection/segmentation, optical flow, and machine translation, the approach achieves hardware-friendly sparsity (notably 2:4 and 4:8 patterns) with competitive or superior performance compared to dense baselines and prior sparsity methods. The results suggest practical pathways for deploying sparse models on modern GPUs, with broad implications for accelerator-aware neural network design.

Abstract

Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. It can be generally categorized into unstructured fine-grained sparsity that zeroes out multiple individual weights distributed across the neural network, and structured coarse-grained sparsity which prunes blocks of sub-networks of a neural network. Fine-grained sparsity can achieve a high compression ratio but is not hardware friendly and hence receives limited speed gains. On the other hand, coarse-grained sparsity cannot concurrently achieve both apparent acceleration on modern GPUs and decent performance. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network, which can maintain the advantages of both unstructured fine-grained sparsity and structured coarse-grained sparsity simultaneously on specifically designed GPUs. Specifically, a 2:4 sparse network could achieve 2x speed-up without performance drop on Nvidia A100 GPUs. Furthermore, we propose a novel and effective ingredient, sparse-refined straight-through estimator (SR-STE), to alleviate the negative influence of the approximated gradients computed by vanilla STE during optimization. We also define a metric, Sparse Architecture Divergence (SAD), to measure the sparse network's topology change during the training process. Finally, We justify SR-STE's advantages with SAD and demonstrate the effectiveness of SR-STE by performing comprehensive experiments on various tasks. Source codes and models are available at https://github.com/NM-sparsity/NM-sparsity.

Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch

TL;DR

The paper tackles the challenge of speeding up deep networks without sacrificing accuracy by introducing : fine-grained structured sparsity trained from scratch. It extends the straight-through estimator with a sparse-refined term (SR-STE) and introduces Sparse Architecture Divergence (SAD) to quantify topology changes during training, demonstrating that SR-STE stabilizes learning and reduces SAD. Across image classification, object detection/segmentation, optical flow, and machine translation, the approach achieves hardware-friendly sparsity (notably 2:4 and 4:8 patterns) with competitive or superior performance compared to dense baselines and prior sparsity methods. The results suggest practical pathways for deploying sparse models on modern GPUs, with broad implications for accelerator-aware neural network design.

Abstract

Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. It can be generally categorized into unstructured fine-grained sparsity that zeroes out multiple individual weights distributed across the neural network, and structured coarse-grained sparsity which prunes blocks of sub-networks of a neural network. Fine-grained sparsity can achieve a high compression ratio but is not hardware friendly and hence receives limited speed gains. On the other hand, coarse-grained sparsity cannot concurrently achieve both apparent acceleration on modern GPUs and decent performance. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network, which can maintain the advantages of both unstructured fine-grained sparsity and structured coarse-grained sparsity simultaneously on specifically designed GPUs. Specifically, a 2:4 sparse network could achieve 2x speed-up without performance drop on Nvidia A100 GPUs. Furthermore, we propose a novel and effective ingredient, sparse-refined straight-through estimator (SR-STE), to alleviate the negative influence of the approximated gradients computed by vanilla STE during optimization. We also define a metric, Sparse Architecture Divergence (SAD), to measure the sparse network's topology change during the training process. Finally, We justify SR-STE's advantages with SAD and demonstrate the effectiveness of SR-STE by performing comprehensive experiments on various tasks. Source codes and models are available at https://github.com/NM-sparsity/NM-sparsity.

Paper Structure

This paper contains 23 sections, 6 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of achieving $N$:$M$ structure sparsity. (Left) In a weight matrix of $2$:$4$ sparse neural network, whose shape is $R\times C$ (e.g., $R=\textrm{output\_channels}$ and $C=\textrm{input\_channels}$ in a linear layer), at least two entries would be zero in each group of 4 consecutive weights. (Middle & Right) The process that the original matrix is compressed, which enables processing of the matrix to be further accelerated by designated processing units (e.g., Nvidia A100).
  • Figure 2: In this figure, $\bigodot$ represents element-wise multiplication and $\bigotimes$ indicates matrix multiplication. (a) This figure shows the forward and backward pass during training an N:M sparse network. In the forward stage, $\widetilde{\mathcal{W}}$ is obtained by pruning $\mathcal{W}$. And in the backward stage, the gradient w.r.t. $\widetilde{\mathcal{W}}$ will be applied to $\mathcal{W}$ directly. (b) This figure illustrates the training process with SR-STE. The forward pass is the same as in (b). However, in the backward pass, the weights of $\mathcal{W}$ are updated by not only $\frac{\partial \mathcal{L}}{\partial \widetilde{\mathcal{W}}}$, but also $\bar{\mathcal{E}} \odot \mathcal{W}$, where $\bar{\mathcal{E}}$ is the mask matrix for the pruned weights in $\widetilde{\mathcal{W}}$.
  • Figure 3: We compare two networks respectively trained with regular SGD method and STE-modified gradient descent. (a) This figure shows sparse networks trained with STE has a significant performance drop in top-1 accuracy compared with dense networks. (b) This figure illustrates the layer-wise SAD between the weights after certain number of iterations and the initial weights, for two networks trained with STE (sparse forward) and regular SGD(dense forward). Compared with networks trained with sparse forward gradient, the one with dense forward gradient displays smaller SAD, indicating fewer updates in its sparse network architectures.
  • Figure 4: (a) This figure illustrates SAD as a function of training epoch number with 4 different settings of $\lambda_W$ in the SR-STE term. When $\lambda_W<0$, the perturbations brought by coarse gradients of sparse wights are widened, SAD gets higher and the top-1 accuracy becomes lower. When $\lambda_W$ is set to a reasonable positive value, sparse nets received high performance and low SAD. (b) This figure compares the top-1 accuracy curves of sparse net trained with STE, sparse net trained with SR-STE, and dense net. Sparse networks naively trained with STE have significant performance drop compared with dense ones. After introducing the SR-STE term into optimization process, the sparse network's performance jumps to a comparable level with dense networks.
  • Figure 5: Illustration of kernel shape in ResNet50 with 2:8 structured sparsity trained model, layer1.1.conv2: (0,32) denotes layer name: (index of input channel, index of output channel).
  • ...and 1 more figures