Weight Variance Amplifier Improves Accuracy in High-Sparsity One-Shot Pruning
Vincent-Daniel Yun, Junhyuk Jo, Sunwoo Lee
TL;DR
This work introduces Variance Amplifying Regularizer (VAR), a lightweight training regularizer that enlarges weight variance to improve robustness of neural networks under very high sparsity from one-shot pruning. VAR integrates with standard SGD and can complement pruning-robust optimizers like SAM or CrAM, promoting a broader weight distribution with more near-zero values. The authors prove that VAR preserves SGD convergence under milder smoothness assumptions and demonstrate, across CNNs and Vision Transformers on classification and medical segmentation tasks, that VAR consistently preserves accuracy after aggressive pruning and yields flatter loss landscapes. Collectively, VAR offers a computationally efficient strategy to enhance structural robustness and pruning resilience in both convolutional and transformer models, with practical benefits for deployment on resource-constrained devices.
Abstract
Deep neural networks achieve outstanding performance in visual recognition tasks, yet their large number of parameters makes them less practical for real-world applications. Recently, one-shot pruning has emerged as an effective strategy for reducing model size without additional training. However, models trained with standard objective functions often suffer a significant drop in accuracy after aggressive pruning. Some existing pruning-robust optimizers, such as SAM, and CrAM, mitigate this accuracy drop by guiding the model toward flatter regions of the parameter space, but they inevitably incur non-negligible additional computations. We propose a Variance Amplifying Regularizer (VAR) that deliberately increases the variance of model parameters during training. Our study reveals an intriguing finding that parameters with higher variance exhibit greater pruning robustness. VAR exploits this property by promoting such variance in the weight distribution, thereby mitigating the adverse effects of pruning. We further provide a theoretical analysis of its convergence behavior, supported by extensive empirical results demonstrating the superior pruning robustness of VAR.
