Table of Contents
Fetching ...

No Free Prune: Information-Theoretic Barriers to Pruning at Initialization

Tanishq Kumar, Kevin Luo, Mark Sellke

TL;DR

This work provides an information-theoretic barrier to pruning neural networks at initialization by introducing an effective parameter count $p_{ ext{eff}} = I(\mathbf{m}^W; \mathcal{D}) + \mathbb{E}[\|\mathbf{m}\|_1]$, which trades off traditional parameter count against data-dependent information in the sparsity mask. The authors extend the Law of Robustness to this $p_{ ext{eff}}$, showing that achieving robust, noise-interpolating interpolation at high sparsity requires a Lipschitz constant that scales with $\sqrt{nd / p_{\text{eff}}}$, thereby discouraging data-agnostic pruning at initialization. They argue that pruning after training (eg, iterative magnitude pruning) increases mutual information with the data, inflating $p_{\text{eff}}$ and yielding higher capacity than data-agnostic pruning at initialization; this reconciles the existence of lottery tickets with the difficulty of finding them quickly. Experiments on memorization capacity and noise correlation corroborate that information gained during training affects model capacity and that data-dependent pruning methods yield higher effective capacity than initialization-based ones. Overall, the paper provides a principled framework for understanding why pruning at initialization struggles and why lottery tickets arise only when masks are learned from data through training.

Abstract

The existence of "lottery tickets" arXiv:1803.03635 at or near initialization raises the tantalizing question of whether large models are necessary in deep learning, or whether sparse networks can be quickly identified and trained without ever training the dense models that contain them. However, efforts to find these sparse subnetworks without training the dense model ("pruning at initialization") have been broadly unsuccessful arXiv:2009.08576. We put forward a theoretical explanation for this, based on the model's effective parameter count, $p_\text{eff}$, given by the sum of the number of non-zero weights in the final network and the mutual information between the sparsity mask and the data. We show the Law of Robustness of arXiv:2105.12806 extends to sparse networks with the usual parameter count replaced by $p_\text{eff}$, meaning a sparse neural network which robustly interpolates noisy data requires a heavily data-dependent mask. We posit that pruning during and after training outputs masks with higher mutual information than those produced by pruning at initialization. Thus two networks may have the same sparsities, but differ in effective parameter count based on how they were trained. This suggests that pruning near initialization may be infeasible and explains why lottery tickets exist, but cannot be found fast (i.e. without training the full network). Experiments on neural networks confirm that information gained during training may indeed affect model capacity.

No Free Prune: Information-Theoretic Barriers to Pruning at Initialization

TL;DR

This work provides an information-theoretic barrier to pruning neural networks at initialization by introducing an effective parameter count , which trades off traditional parameter count against data-dependent information in the sparsity mask. The authors extend the Law of Robustness to this , showing that achieving robust, noise-interpolating interpolation at high sparsity requires a Lipschitz constant that scales with , thereby discouraging data-agnostic pruning at initialization. They argue that pruning after training (eg, iterative magnitude pruning) increases mutual information with the data, inflating and yielding higher capacity than data-agnostic pruning at initialization; this reconciles the existence of lottery tickets with the difficulty of finding them quickly. Experiments on memorization capacity and noise correlation corroborate that information gained during training affects model capacity and that data-dependent pruning methods yield higher effective capacity than initialization-based ones. Overall, the paper provides a principled framework for understanding why pruning at initialization struggles and why lottery tickets arise only when masks are learned from data through training.

Abstract

The existence of "lottery tickets" arXiv:1803.03635 at or near initialization raises the tantalizing question of whether large models are necessary in deep learning, or whether sparse networks can be quickly identified and trained without ever training the dense models that contain them. However, efforts to find these sparse subnetworks without training the dense model ("pruning at initialization") have been broadly unsuccessful arXiv:2009.08576. We put forward a theoretical explanation for this, based on the model's effective parameter count, , given by the sum of the number of non-zero weights in the final network and the mutual information between the sparsity mask and the data. We show the Law of Robustness of arXiv:2105.12806 extends to sparse networks with the usual parameter count replaced by , meaning a sparse neural network which robustly interpolates noisy data requires a heavily data-dependent mask. We posit that pruning during and after training outputs masks with higher mutual information than those produced by pruning at initialization. Thus two networks may have the same sparsities, but differ in effective parameter count based on how they were trained. This suggests that pruning near initialization may be infeasible and explains why lottery tickets exist, but cannot be found fast (i.e. without training the full network). Experiments on neural networks confirm that information gained during training may indeed affect model capacity.
Paper Structure (33 sections, 12 theorems, 50 equations, 3 figures)

This paper contains 33 sections, 12 theorems, 50 equations, 3 figures.

Key Result

Theorem 4.1

Let $\mathcal{F}$ be a class of functions from $\mathbb{R}^d \rightarrow \mathbb{R}$ and let $\left(x_i, y_i\right)_{i \in[n]}$ be i.i.d. input-output pairs in $\mathbb{R}^d \times[-1,1]$. Assume that: Then, with high probability over the sampling of the data, one has simultaneously for all $f \in \mathcal{F}$: Here $\operatorname{Lip}(f)$ denotes the Lipschitz constant of $f$.

Figures (3)

  • Figure 1: Top: memorization capacity (train accuracy on noisy data) against sparsity level for different pruning methods. Staying higher on plots is better. Vertical gap between IMP/Magnitude-after pruning reflect additional memorization capacity on this dataset due to mutual information between mask and data. 2-hidden layer MLP on Gaussian data \ref{['fig:memorization-left']} and noisy FashionMNIST \ref{['fig:memorization-right']}, 4-layer convNet on noisy CIFAR-10 \ref{['fig:cifar-final']}. Bottom: Ability to correlate with dataset noise over training as a proxy for network capacity. 5-layer ReLU MLP in a student-teacher task, with $\sigma^2 = 1$ noisy labels; \ref{['fig:imp_corr_epochs_small']}: correlation with noise during IMP; training increases correlation with noise (hence $p_{\text{eff}}$), pruning then reduces this, before repeating; \ref{['fig:corr-middle']}: Sweep over learning rates; \ref{['fig:corr-right']}: Sweep over amount of noise injected into gradients. This illustrates that our quantity of interest, $I(f^W;\mathcal{D})$, and thus $p_{\text{eff}}$, is increasing due to data contained in gradients.
  • Figure 2: Correlation with data noise over training epochs, IMP. Expanded version of \ref{['fig:imp_corr_epochs_small']} pruned to almost complete sparsity.
  • Figure 3: Exact mutual information over on small toy model. Left is regression, right is classification.

Theorems & Definitions (18)

  • Theorem 4.1: Theorem 1, bubeck2021universal, informal
  • Theorem 4.2: Informal, Modified Law of Robustness
  • Definition 5.1
  • Lemma 5.2: xu2017information
  • Theorem 5.3
  • Lemma 5.4
  • Lemma 5.5
  • Theorem 5.6
  • Lemma 1.1
  • proof
  • ...and 8 more