Dynamic Sparse Training with Structured Sparsity

Mike Lasby; Anna Golubeva; Utku Evci; Mihai Nica; Yani Ioannou

Dynamic Sparse Training with Structured Sparsity

Mike Lasby, Anna Golubeva, Utku Evci, Mihai Nica, Yani Ioannou

TL;DR

SRigL introduces a sparse-to-sparse training method that enforces constant fan-in $k$ per neuron to realize fast inference with $N:M$-structured sparsity. It combines dynamic sparse training (RigL-style) with a neuron ablation mechanism ($\gamma_{sal}$) to preserve performance at very high sparsities (up to $99\%$). The method yields real-world acceleration on CPU online inference ($3.4\times$ faster than dense, $2.5\times$ faster than unstructured) and GPU batched inference ($1.7\times$ faster than dense, $13\times$ faster than unstructured) using a 90% sparse layer. Experiments on ResNet-18/CIFAR-10, ResNet-50/ImageNet, and ViT/ImageNet demonstrate competitive accuracy relative to dense baselines and prior sparse methods while delivering practical speedups. This work motivates hardware-aware, fine-grained structured sparsity as a viable path to scalable sparse training and inference.

Abstract

Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network training, matching the generalization of dense models while enabling sparse training and inference. Although the resulting models are highly sparse and theoretically less computationally expensive, achieving speedups with unstructured sparsity on real-world hardware is challenging. In this work, we propose a sparse-to-sparse DST method, Structured RigL (SRigL), to learn a variant of fine-grained structured N:M sparsity by imposing a constant fan-in constraint. Using our empirical analysis of existing DST methods at high sparsity, we additionally employ a neuron ablation method which enables SRigL to achieve state-of-the-art sparse-to-sparse structured DST performance on a variety of Neural Network (NN) architectures. Using a 90% sparse linear layer, we demonstrate a real-world acceleration of 3.4x/2.5x on CPU for online inference and 1.7x/13.0x on GPU for inference with a batch size of 256 when compared to equivalent dense/unstructured (CSR) sparse layers, respectively.

Dynamic Sparse Training with Structured Sparsity

TL;DR

SRigL introduces a sparse-to-sparse training method that enforces constant fan-in

per neuron to realize fast inference with

-structured sparsity. It combines dynamic sparse training (RigL-style) with a neuron ablation mechanism (

) to preserve performance at very high sparsities (up to

). The method yields real-world acceleration on CPU online inference (

faster than dense,

faster than unstructured) and GPU batched inference (

faster than dense,

faster than unstructured) using a 90% sparse layer. Experiments on ResNet-18/CIFAR-10, ResNet-50/ImageNet, and ViT/ImageNet demonstrate competitive accuracy relative to dense baselines and prior sparse methods while delivering practical speedups. This work motivates hardware-aware, fine-grained structured sparsity as a viable path to scalable sparse training and inference.

Abstract

Paper Structure (33 sections, 5 theorems, 18 equations, 21 figures, 9 tables, 1 algorithm)

This paper contains 33 sections, 5 theorems, 18 equations, 21 figures, 9 tables, 1 algorithm.

Introduction
Related work
Dynamic sparse training
Accelerating unstructured sparse neural networks
Learning block structured sparsity from scratch
Learning N:M structured sparsity from scratch
Accelerating fine-grained N:M structured sparsity
Constant fan-in N:M structured sparsity
Online inference
Method
Structured rigl
Results
ResNet-18 trained on CIFAR-10
ResNet-50 trained on ImageNet
Vision Transformer trained on ImageNet
...and 18 more sections

Key Result

Proposition B.2

The variance of each entry $z_{i}$ is: and therefore the distribution of each $z_{i}$ can be written as where $g_{i}$ are $N$ iid $\mathcal{N}(0,1)$ random variables.

Figures (21)

Figure 1: (\ref{['fig:constfanin']}) Constant fan-in pruning keeps the most salient weights per neuron, while unstructured pruning keeps the most salient weights per layer. A constant fan-in weight matrix has the same number of non-zero elements (here 2) per column allowing condensed representation. While pruning may remove salient weights affecting generalization, with srigl structure and weights are learned concurrently. (\ref{['fig:theory']}) Output-norm variance: Theoretical predictions and simulation results (see \ref{['sec:outputnormvariance']}) demonstrating that sparse layers with constant fan-in have consistently smaller output-norm variance than layers with the same sparsity but w/o the constant fan-in constraint.
Figure 2: Neuron ablation. At sparsity levels over 90%, learns to completely mask (ablate) a large number of neurons within each layer, effectively reducing layer width. Imposing a constant fan-in constraint requires all neurons to have the same number of (non-pruned) incoming weights and therefore inhibits ablation, which results in worse generalization performance than . Allowing to ablate neurons restores -level performance.
Figure 3: (\ref{['fig:resnet50_acc_vs_sparsity']}) ResNet-50/ImageNet top-1 test accuracy when trained with for a range of sparsities is comparable to . Extended training durations of $\times 2$ and $\times 5$ are also reported for . Results reported are single runs. (\ref{['fig:imagenet_perc_active']}) Neuron ablation: The percentage active neurons (i.e., not ablated) following / training on ResNet-50/ImageNet. ablates a large number of neurons at high sparsities.
Figure 4: Comparing real-world timings for a fully-connected layer extracted from a model trained with when compressed using the condensed representation learned by to structured (i.e. the same layer accelerated using only the ablated neurons without exploiting the fine-grained sparsity), and unstructured (i.e. ) representations. The median over a minimum of 5 runs is shown, while the error bars show the std. dev. Note: the increased timings for the 95 & 99% sparse structured representations is due to ablating relatively fewer neurons at these sparsities compared to 80 and 90%.(\ref{['fig:online-inference-timings']}) CPU wall-clock timings for online inference on an Intel Xeon W-2145. For online (single input) inference, our condensed representation at 90% is 3.4$\times$ faster than dense and 2.5 $\times$ faster than unstructured sparsity. See \ref{['sec:timingsdetails']}. (\ref{['fig:gpu-accel-bs-256']}) GPU wall-clock timings for inference with a batch size of 256 on an NVIDIA Titan V. At 90% sparsity, our condensed representation is 1.7$\times$ faster than dense and 13.0$\times$ faster than unstructured () sparse layers. Note y-axis is log-scaled.
Figure 5: Test accuracy of Wide ResNet-22 trained on CIFAR-10. Mean and 95% confidence intervals are reported over five runs.
...and 16 more figures

Theorems & Definitions (10)

Definition B.1
Proposition B.2
proof
Corollary B.3
Proposition B.4: "Bernoulli Sparsity"
proof
Proposition B.5: "Constant-per-layer sparsity"
proof
Proposition B.6: "Constant Fan-In sparsity"
proof

Dynamic Sparse Training with Structured Sparsity

TL;DR

Abstract

Dynamic Sparse Training with Structured Sparsity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (10)