Sparsest Models Elude Pruning: An Exposé of Pruning's Current Capabilities

Stephen Zhang; Vardan Papyan

Sparsest Models Elude Pruning: An Exposé of Pruning's Current Capabilities

Stephen Zhang, Vardan Papyan

TL;DR

The paper investigates whether state-of-the-art pruning methods can recover the sparsest subnetworks that still achieve a target accuracy. It introduces the Cubist Spiral dataset and a two-phase combinatorial search that first enforces structured sparsity and then unstructured sparsity to establish a lower bound on sparsity, against which pruning methods are benchmarked. The results show a substantial gap: sparse models with as few as $30$–$45$ nonzeros can reach high accuracy, while leading pruning methods require many more nonzeros and often create disconnected paths, even with optimal initialization and width. Overparameterization tends to hinder pruning, and pruning after training does not reach the minimal sparsity masked by the combinatorial search, challenging the current pruning paradigm and motivating new approaches that better preserve connectivity and leverage structured sparsity.

Abstract

Pruning has emerged as a promising approach for compressing large-scale models, yet its effectiveness in recovering the sparsest of models has not yet been explored. We conducted an extensive series of 485,838 experiments, applying a range of state-of-the-art pruning algorithms to a synthetic dataset we created, named the Cubist Spiral. Our findings reveal a significant gap in performance compared to ideal sparse networks, which we identified through a novel combinatorial search algorithm. We attribute this performance gap to current pruning algorithms' poor behaviour under overparameterization, their tendency to induce disconnected paths throughout the network, and their propensity to get stuck at suboptimal solutions, even when given the optimal width and initialization. This gap is concerning, given the simplicity of the network architectures and datasets used in our study. We hope that our research encourages further investigation into new pruning techniques that strive for true network sparsity.

Sparsest Models Elude Pruning: An Exposé of Pruning's Current Capabilities

TL;DR

–

nonzeros can reach high accuracy, while leading pruning methods require many more nonzeros and often create disconnected paths, even with optimal initialization and width. Overparameterization tends to hinder pruning, and pruning after training does not reach the minimal sparsity masked by the combinatorial search, challenging the current pruning paradigm and motivating new approaches that better preserve connectivity and leverage structured sparsity.

Abstract

Paper Structure (64 sections, 1 theorem, 9 equations, 24 figures, 1 table)

This paper contains 64 sections, 1 theorem, 9 equations, 24 figures, 1 table.

Introduction
Method Overview
Contributions
Background: Pruning Algorithms
Dense to Sparse
Pruning at Initialization
Sparse to Sparse
Methodology
Network
Dataset
Combinatorial Search
First Phase: Structured Sparsity
Second Phase: Unstructured Sparsity
Selection of Pruning Algorithms
Initialization Experiments
...and 49 more sections

Key Result

Theorem 12.1

Consider an $L$-layer multilayer perceptron with weights ${\bm{W}}^{[1]}, ..., {\bm{W}}^{[L]}$ where ${\bm{W}}^{[1]}\in \mathbb{R}^{d\times w}, {\bm{W}}^{[2]}, ..., {\bm{W}}^{[L-1]}\in \mathbb{R}^{w \times w}, {\bm{W}}^{[L]} \in \mathbb{R}^{w \times C}$. Suppose the model is randomly pruned such tha

Figures (24)

Figure 1: Sparse Model Visualization. Visualization of a sparse model, discovered through our combinatorial search algorithm, trained on the Cubist Spiral dataset. The first two squares on the left denote the input variables, while the final, larger square depicts the output from the classifier. The intermediate squares reveal post-activation states which are connected by edges, corresponding to entries in weight matrices. At the top of each square, there is a tiny square that is colored according to the bias of the corresponding neuron. Blue is used to represent a positive value, orange a negative value, and white -- a value of zero.
Figure 2: Comparative view of spiral datasets.
Figure 3: First Phase. Two structured sparsity masks that would be tested by the first phase of the combinatorial search. It is always the first $d^{[\ell-1]}$ columns and $d^{[\ell]}$ rows that are nonzero inside the masks, denoted by the light red squares. If both sets of masks reach the target accuracy, the set of masks on the right will be utilized by the second phase as it contains fewer nonzeros.
Figure 4: Second Phase. A schematic illustration of the second phase of the combinatorial search. Left: A set of unstructured sparsity masks that would be tested in the second phase, generated by utilizing the minimal structured sparsity masks found in the first phase. The dark red squares denote the nonzero entries in the unstructured sparsity masks and the light red squares denote the nonzero entries of the minimal structured sparsity masks found in the first phase. Right: An ineligible mask containing rows and columns without at least one nonzero element, which does not fully utilize the minimal number of neurons identified in the first phase.
Figure 5: Phase One of Combinatorial Search. Scatter plot with each point corresponding to a different model that was trained with a different structured mask. Two models are highlighted -- the sparsest achieving above 95% accuracy (where the number of neurons in each layer is 3,3,3) and the sparsest achieving above 99.5% accuracy (where the number of neurons in each layer is 7,3,3) -- accompanied by their corresponding reconstructions of the spiral.
...and 19 more figures

Theorems & Definitions (1)

Theorem 12.1

Sparsest Models Elude Pruning: An Exposé of Pruning's Current Capabilities

TL;DR

Abstract

Sparsest Models Elude Pruning: An Exposé of Pruning's Current Capabilities

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (24)

Theorems & Definitions (1)