Pruning at Initialization -- A Sketching Perspective

Noga Bar; Raja Giryes

Pruning at Initialization -- A Sketching Perspective

Noga Bar, Raja Giryes

TL;DR

This paper reframes pruning at initialization as a sketching problem for efficient matrix-vector multiplications, showing that finding a sparse mask corresponds to sampling coordinates with an optimal probability. It derives explicit bounds on the approximation error when applying the initialization mask to the end-of-training vector and demonstrates that data-free pruning can explain the success of lottery tickets, supported by a sketching-based analysis. The authors connect pruning methods like SynFlow and SNIP to sketching, propose randomized mask strategies to improve data-free pruning, and extend the perspective to Neural Tangent Kernel pruning with theoretical bounds. Empirically, randomized, data-free pruning performs competitively across multiple architectures and datasets, suggesting practical benefits and robustness when data are unavailable. Overall, the sketching lens provides theoretical grounding for data-independence in sparse subnetworks and offers concrete algorithmic improvements for pruning without data.

Abstract

The lottery ticket hypothesis (LTH) has increased attention to pruning neural networks at initialization. We study this problem in the linear setting. We show that finding a sparse mask at initialization is equivalent to the sketching problem introduced for efficient matrix multiplication. This gives us tools to analyze the LTH problem and gain insights into it. Specifically, using the mask found at initialization, we bound the approximation error of the pruned linear model at the end of training. We theoretically justify previous empirical evidence that the search for sparse networks may be data independent. By using the sketching perspective, we suggest a generic improvement to existing algorithms for pruning at initialization, which we show to be beneficial in the data-independent case.

Pruning at Initialization -- A Sketching Perspective

TL;DR

Abstract

Paper Structure (17 sections, 8 theorems, 42 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 17 sections, 8 theorems, 42 equations, 4 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Sketching and Pruning at Initialization
Pruning Approximation Error
Using Sketching With Existing Pruning Methods
SynFlow and Sketching
SNIP and Sketching
Neural Tangent Kernel Pruning
Experiments
Conclusion and Future Work
Results with Layerwise Sparsity
Useful Lemmas and Proofs
Proof of \ref{['lemma:error_x']}
Proof of \ref{['lemma:sketching_rand_data']}
Proof of \ref{['thm:2weight_rand_data']}
...and 2 more sections

Key Result

Lemma 4.1

drineas2006fast[lemma]lemma:sketching Suppose $X\in\mathbb{R}^{d\times n}$, $w^0\in\mathbb{R}^d$ and $s\in \mathbb{Z}^+$ then using algo:sketch_mask with $p^0$ for $m$ then the error is

Figures (4)

Figure 1: Pruning a fully connected NN with Fashion-MNIST: (left) Norms of winning lottery tickets and random masks with multiple sparsities. (right) Winning tickets norms vs. 10,000 random masks with 1.2% density.
Figure 2: Histogram of scores of NN before pruning and the weights chosen by SynFlow with/without randomization.
Figure 3: Weights histogram in sparse subnetworks at initialization for VGG-19 and CIFAR-10 for 2% (left) and 5% remaining weights. SynFlow and IMP have bias to large magnitude weights compared to a uniformly random mask.
Figure : Sketching for mask drineas2006fast.

Theorems & Definitions (16)

Lemma 4.1
Lemma 4.2
Lemma 4.3
Theorem 4.4
Lemma 4.5
Definition 6.1: Local Lipschitzness of the Jacobian
Theorem 6.2
Lemma B.1
proof : \ref{['lemma:expectation']}
Lemma B.2
...and 6 more

Pruning at Initialization -- A Sketching Perspective

TL;DR

Abstract

Pruning at Initialization -- A Sketching Perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (16)