Table of Contents
Fetching ...

Pushing the Limits of Sparsity: A Bag of Tricks for Extreme Pruning

Andy Li, Aiden Durrant, Milan Markovic, Tianjin Huang, Souvik Kundu, Tianlong Chen, Lu Yin, Georgios Leontidis

TL;DR

<3-5 sentence high-level summary> The paper addresses the challenge of training neural networks at extreme sparsities, where accuracy typically collapses due to gradient flow issues. It introduces Extreme Adaptive Sparse Training (EAST), a modular framework that combines three components—Dynamic ReLU phasing (DyReLU), weight sharing within residual blocks, and cyclic sparsity scheduling—to maintain learning dynamics and exploration under extreme sparsity. The authors demonstrate that EAST achieves competitive or superior accuracy on ResNet-34/50 across CIFAR-10/100 and ImageNet at sparsities up to 99.99%, often outperforming prior DST and SST methods, and shows favorable inference characteristics without dense offset computations. The work also provides complexity analyses and ablations validating the contribution of each component, highlighting EAST as a practical strategy for deploying highly compressed models on resource-constrained devices.

Abstract

Pruning of deep neural networks has been an effective technique for reducing model size while preserving most of the performance of dense networks, crucial for deploying models on memory and power-constrained devices. While recent sparse learning methods have shown promising performance up to moderate sparsity levels such as 95% and 98%, accuracy quickly deteriorates when pushing sparsities to extreme levels due to unique challenges such as fragile gradient flow. In this work, we explore network performance beyond the commonly studied sparsities, and develop techniques that encourage stable training without accuracy collapse even at extreme sparsities, including 99.90%, 99.95\% and 99.99% on ResNet architectures. We propose three complementary techniques that enhance sparse training through different mechanisms: 1) Dynamic ReLU phasing, where DyReLU initially allows for richer parameter exploration before being gradually replaced by standard ReLU, 2) weight sharing which reuses parameters within a residual layer while maintaining the same number of learnable parameters, and 3) cyclic sparsity, where both sparsity levels and sparsity patterns evolve dynamically throughout training to better encourage parameter exploration. We evaluate our method, which we term Extreme Adaptive Sparse Training (EAST) at extreme sparsities using ResNet-34 and ResNet-50 on CIFAR-10, CIFAR-100, and ImageNet, achieving competitive or improved performance compared to existing methods, with notable gains at extreme sparsity levels.

Pushing the Limits of Sparsity: A Bag of Tricks for Extreme Pruning

TL;DR

<3-5 sentence high-level summary> The paper addresses the challenge of training neural networks at extreme sparsities, where accuracy typically collapses due to gradient flow issues. It introduces Extreme Adaptive Sparse Training (EAST), a modular framework that combines three components—Dynamic ReLU phasing (DyReLU), weight sharing within residual blocks, and cyclic sparsity scheduling—to maintain learning dynamics and exploration under extreme sparsity. The authors demonstrate that EAST achieves competitive or superior accuracy on ResNet-34/50 across CIFAR-10/100 and ImageNet at sparsities up to 99.99%, often outperforming prior DST and SST methods, and shows favorable inference characteristics without dense offset computations. The work also provides complexity analyses and ablations validating the contribution of each component, highlighting EAST as a practical strategy for deploying highly compressed models on resource-constrained devices.

Abstract

Pruning of deep neural networks has been an effective technique for reducing model size while preserving most of the performance of dense networks, crucial for deploying models on memory and power-constrained devices. While recent sparse learning methods have shown promising performance up to moderate sparsity levels such as 95% and 98%, accuracy quickly deteriorates when pushing sparsities to extreme levels due to unique challenges such as fragile gradient flow. In this work, we explore network performance beyond the commonly studied sparsities, and develop techniques that encourage stable training without accuracy collapse even at extreme sparsities, including 99.90%, 99.95\% and 99.99% on ResNet architectures. We propose three complementary techniques that enhance sparse training through different mechanisms: 1) Dynamic ReLU phasing, where DyReLU initially allows for richer parameter exploration before being gradually replaced by standard ReLU, 2) weight sharing which reuses parameters within a residual layer while maintaining the same number of learnable parameters, and 3) cyclic sparsity, where both sparsity levels and sparsity patterns evolve dynamically throughout training to better encourage parameter exploration. We evaluate our method, which we term Extreme Adaptive Sparse Training (EAST) at extreme sparsities using ResNet-34 and ResNet-50 on CIFAR-10, CIFAR-100, and ImageNet, achieving competitive or improved performance compared to existing methods, with notable gains at extreme sparsity levels.

Paper Structure

This paper contains 11 sections, 10 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of EAST (top). Starting with ERK-initialized sparse network $\mathcal{A}$ = {$\theta_0$, $\mathcal{M}_0$} at sparsity s, EAST employs three key components: DyReLU phasing, weight sharing, and cyclic sparsity to transform the network to final state $\mathcal{A}'$ = {$\theta_T$, $\mathcal{M}_T$}, achieving meaningful performance at extreme sparsity levels. The topology update box (bottom) illustrates the connectivity change throughout training: connections are first grown, then pruned, and eventually maintained with a fixed sparsity update schedule until completion.
  • Figure 2: Comparison of test accuracies across different sparsities. Each point represents median accuracy over three runs with different seeds, and the shaded regions highlight the variability across runs.
  • Figure 3: Positive pre-activations analysis in ResNet-34 at 99.99% sparsity. The left figure shows layerwise-comparison of positive pre-activations after DyReLU is completely converted to ReLU. The right figure shows their overall amount before, during and after DyReLU phasing.
  • Figure 4: A layer in ResNet-34 with 4 blocks. Block 3 and Block 4 share parameters (and the masks) with block 2. Specifically, conv1 layer (green) of block 3 reuses conv1 layer of block 2, and is multiplied by a learnable scaling factor; similarly, its conv2 layer (blue) reuses conv2 layer of block 2, and is multiplied by another scaling factor.
  • Figure 5: Gradient flow analysis. The top row compares gradient with and without DyReLU. The bottom row compares gradient with and without weight sharing.
  • ...and 1 more figures