Always-Sparse Training by Growing Connections with Guided Stochastic Exploration

Mike Heddes; Narayan Srinivasa; Tony Givargis; Alexandru Nicolau

Always-Sparse Training by Growing Connections with Guided Stochastic Exploration

Mike Heddes, Narayan Srinivasa, Tony Givargis, Alexandru Nicolau

TL;DR

The paper tackles the computational bottleneck of training large neural networks by proposing an always-sparse dynamic sparse training (DST) method guided by stochastic exploration (GSE). GSE samples a subset of inactive connections and grows those with the largest gradient magnitudes within the subset, yielding $O(n+N)$ training complexity (with $N$ non-zero weights and $N\le n^2$; $O(n)$ under Erdős–Rényi initialization). Empirically, GSE outperforms existing sparse-training methods across CIFAR-10/100 and ImageNet for ResNet, VGG, and ViT at high sparsities, and reduces training FLOPs relative to dense baselines, especially as sparsity increases. The results indicate larger, sparser CNNs can achieve higher accuracy with extended training, suggesting practical scalability for training very large models with constrained resources.

Abstract

The excessive computational requirements of modern artificial neural networks (ANNs) are posing limitations on the machines that can run them. Sparsification of ANNs is often motivated by time, memory and energy savings only during model inference, yielding no benefits during training. A growing body of work is now focusing on providing the benefits of model sparsification also during training. While these methods greatly improve the training efficiency, the training algorithms yielding the most accurate models still materialize the dense weights, or compute dense gradients during training. We propose an efficient, always-sparse training algorithm with excellent scaling to larger and sparser models, supported by its linear time complexity with respect to the model width during training and inference. Moreover, our guided stochastic exploration algorithm improves over the accuracy of previous sparse training methods. We evaluate our method on CIFAR-10/100 and ImageNet using ResNet, VGG, and ViT models, and compare it against a range of sparsification methods.

Always-Sparse Training by Growing Connections with Guided Stochastic Exploration

TL;DR

training complexity (with

non-zero weights and

;

under Erdős–Rényi initialization). Empirically, GSE outperforms existing sparse-training methods across CIFAR-10/100 and ImageNet for ResNet, VGG, and ViT at high sparsities, and reduces training FLOPs relative to dense baselines, especially as sparsity increases. The results indicate larger, sparser CNNs can achieve higher accuracy with extended training, suggesting practical scalability for training very large models with constrained resources.

Abstract

Paper Structure (36 sections, 1 theorem, 12 equations, 8 figures, 5 tables, 2 algorithms)

This paper contains 36 sections, 1 theorem, 12 equations, 8 figures, 5 tables, 2 algorithms.

Introduction
Related work
After training
During training
Before training
Dynamic sparse training
Guided stochastic exploration
Efficient subset sampling
Dynamic sparse training
Experiments
Effect of subset sample size
Comparison with related work
ImageNet
Model scaling
Extended training
...and 21 more sections

Key Result

Lemma 1

The gradient magnitude of the loss with respect to the parameters of a feedforward layer $l$ has the following upper bound: where $\theta_{l} \in \mathbb{R}^{n_{l-1}\times n_{l}}$, ${\bm{h}}_{l} \in \mathbb{R}^{n_{l}\times B}$, and ${\bm{\delta}}_{l} \in \mathbb{R}^{n_{l} \times B}$ are the weight matrix, activations and gradients of the loss at the output units of the $l$-th layer, respectively

Figures (8)

Figure 1: Illustration of the connections before (left) and after (right) a prune and grow step in dynamic sparse training on a 3-by-3 feedforward layer. The sparse model parameters are represented by the active connections $A$ and their weights $\theta$. Connection $(2,2)$ is pruned because its weight has the lowest magnitude. The connection $(3, 2)$ is grown and its weight will hereafter be optimized with stochastic gradient descent.
Figure 2: Illustration of the relations between the connection sets. The solid area becomes the next active set. The set $W$ is dotted to indicate that it is never materialized.
Figure 3: Accuracy of each distribution while increasing the number of subset samples (bounding the size of the subset) compared against RigL at 98% sparsity.
Figure 4: Improvement in training FLOPs by GSE compared to RigL for ResNet-50 on ImageNet.
Figure 5: Accuracy comparison of uniform and Erdős–Rényi sparsity assignment.
...and 3 more figures

Theorems & Definitions (2)

Lemma 1
proof

Always-Sparse Training by Growing Connections with Guided Stochastic Exploration

TL;DR

Abstract

Always-Sparse Training by Growing Connections with Guided Stochastic Exploration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (2)