Table of Contents
Fetching ...

Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons

Simon Dufort-Labbé, Pierluca D'Oro, Evgenii Nikishin, Razvan Pascanu, Pierre-Luc Bacon, Aristide Baratin

TL;DR

This work reframes dying neurons as a resource for pruning by introducing Demon Pruning (DemP), a dense-to-sparse training method that actively promotes neuron saturation through scheduled regularization of normalization scale parameters and asymmetric noise added to live weights. DemP prunes dead neurons on-the-fly during training, yielding highly structured sparsity with minimal performance loss and substantial training speedups on CIFAR-10, ImageNet, and transformer-like models. The method demonstrates superior accuracy-sparsity tradeoffs compared to strong dense-to-sparse baselines, especially with Adam, and is compatible with existing pruning techniques, offering a practical approach to efficient model compression. Theoretical and empirical analysis links neuron death to SGD noise and hyperparameters, and ablations validate design choices, while broader impacts address energy efficiency and responsible AI considerations.

Abstract

When training neural networks, dying neurons -- units becoming inactive or saturated -- are traditionally seen as harmful. This paper sheds new light on this phenomenon. By exploring the impact of various hyperparameter configurations on dying neurons during training, we gather insights on how to improve upon sparse training approaches to pruning. We introduce Demon Pruning (DemP), a method that controls the proliferation of dead neurons through a combination of noise injection on active units and a one-cycle schedule regularization strategy, dynamically leading to network sparsity. Experiments on CIFAR-10 and ImageNet datasets demonstrate that DemP outperforms existing dense-to-sparse structured pruning methods, achieving better accuracy-sparsity tradeoffs and accelerating training by up to 3.56$\times$. These findings provide a novel perspective on dying neurons as a resource for efficient model compression and optimization.

Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons

TL;DR

This work reframes dying neurons as a resource for pruning by introducing Demon Pruning (DemP), a dense-to-sparse training method that actively promotes neuron saturation through scheduled regularization of normalization scale parameters and asymmetric noise added to live weights. DemP prunes dead neurons on-the-fly during training, yielding highly structured sparsity with minimal performance loss and substantial training speedups on CIFAR-10, ImageNet, and transformer-like models. The method demonstrates superior accuracy-sparsity tradeoffs compared to strong dense-to-sparse baselines, especially with Adam, and is compatible with existing pruning techniques, offering a practical approach to efficient model compression. Theoretical and empirical analysis links neuron death to SGD noise and hyperparameters, and ablations validate design choices, while broader impacts address energy efficiency and responsible AI considerations.

Abstract

When training neural networks, dying neurons -- units becoming inactive or saturated -- are traditionally seen as harmful. This paper sheds new light on this phenomenon. By exploring the impact of various hyperparameter configurations on dying neurons during training, we gather insights on how to improve upon sparse training approaches to pruning. We introduce Demon Pruning (DemP), a method that controls the proliferation of dead neurons through a combination of noise injection on active units and a one-cycle schedule regularization strategy, dynamically leading to network sparsity. Experiments on CIFAR-10 and ImageNet datasets demonstrate that DemP outperforms existing dense-to-sparse structured pruning methods, achieving better accuracy-sparsity tradeoffs and accelerating training by up to 3.56. These findings provide a novel perspective on dying neurons as a resource for efficient model compression and optimization.
Paper Structure (43 sections, 3 theorems, 18 equations, 17 figures, 3 tables, 1 algorithm)

This paper contains 43 sections, 3 theorems, 18 equations, 17 figures, 3 tables, 1 algorithm.

Key Result

Proposition B.1

Consider the system (eq:SGD_SDE_simple) initialized at $w_0 >0$. The survival probability at time $t>0$ is given by

Figures (17)

  • Figure 1: Dead neuron accumulation for a ResNet-18 trained on CIFAR-10 with different activation functions and values of the learning rate. We use a negative slope of $\alpha=0.05$ for Leaky ReLU and $\beta=1$ for Swish.
  • Figure 2: Left: Increased regularization during training increases the ratio of dead units, as showcased here for a ResNet-18 trained on CIFAR-10. We use $\cdot(\gamma)$ to denote when regularization is applied solely to the scale parameters of the normalization layers. Right: Augmenting training updates with asymmetric Gaussian noise, sampled from $\mathcal{N}(x | 0, \sigma^2)$ and applied to weights of live neurons only, also leads to higher levels of dead unit accumulation, also showcased for a ResNet-18 trained on CIFAR-10 with various values of the Lasso($\gamma$) regularization parameter.
  • Figure 3: Top: For ResNet-18 networks on CIFAR-10 trained with Adam (ReLU), DemP can find sparser solutions while maintaining better performance than other structured approaches. Higher levels of sparsity with DemP are obtained by increasing the peak strength of the added scheduled regularization.Bottom: With SGDM, DemP performance is comparable, without significant differences between methods.Left: Neural sparsity, structured methods. Right: Weight sparsity, structured methods.
  • Figure 4: Impact of different schedules over the regularization parameter for DemP, concluding that a one-cycle scheduler is a good default choice. Experiments were performed with ResNet-18 on CIFAR-10, with Lasso($\gamma$) regularization, across three seeds. Higher sparsities are obtained by increasing the peak strength of the added scheduled regularization.Left: With Adam optimizer. Right: With SGDM optimizer.
  • Figure 5: ResNet-18 networks trained on CIFAR-10 with different added regularization strategies, over three seeds. $\cdot(\gamma)$ denotes when regularization is only applied on the scale parameters of the normalization layer. Higher sparsities are obtained by increasing the peak strength of the added scheduled regularization.Left: With Adam, using L2($\gamma$) regularization slightly outperforms other strategies. Right: Using SGDM, the differences in performance become more pronounced, with Lasso regularization applied to scale parameters providing a favorable balance between sparsity and performance.
  • ...and 12 more figures

Theorems & Definitions (6)

  • Proposition B.1
  • proof
  • Lemma B.2
  • proof
  • Lemma B.3
  • proof