Sparsest Models Elude Pruning: An Exposé of Pruning's Current Capabilities
Stephen Zhang, Vardan Papyan
TL;DR
The paper investigates whether state-of-the-art pruning methods can recover the sparsest subnetworks that still achieve a target accuracy. It introduces the Cubist Spiral dataset and a two-phase combinatorial search that first enforces structured sparsity and then unstructured sparsity to establish a lower bound on sparsity, against which pruning methods are benchmarked. The results show a substantial gap: sparse models with as few as $30$–$45$ nonzeros can reach high accuracy, while leading pruning methods require many more nonzeros and often create disconnected paths, even with optimal initialization and width. Overparameterization tends to hinder pruning, and pruning after training does not reach the minimal sparsity masked by the combinatorial search, challenging the current pruning paradigm and motivating new approaches that better preserve connectivity and leverage structured sparsity.
Abstract
Pruning has emerged as a promising approach for compressing large-scale models, yet its effectiveness in recovering the sparsest of models has not yet been explored. We conducted an extensive series of 485,838 experiments, applying a range of state-of-the-art pruning algorithms to a synthetic dataset we created, named the Cubist Spiral. Our findings reveal a significant gap in performance compared to ideal sparse networks, which we identified through a novel combinatorial search algorithm. We attribute this performance gap to current pruning algorithms' poor behaviour under overparameterization, their tendency to induce disconnected paths throughout the network, and their propensity to get stuck at suboptimal solutions, even when given the optimal width and initialization. This gap is concerning, given the simplicity of the network architectures and datasets used in our study. We hope that our research encourages further investigation into new pruning techniques that strive for true network sparsity.
