Table of Contents
Fetching ...

Weight Variance Amplifier Improves Accuracy in High-Sparsity One-Shot Pruning

Vincent-Daniel Yun, Junhyuk Jo, Sunwoo Lee

TL;DR

This work introduces Variance Amplifying Regularizer (VAR), a lightweight training regularizer that enlarges weight variance to improve robustness of neural networks under very high sparsity from one-shot pruning. VAR integrates with standard SGD and can complement pruning-robust optimizers like SAM or CrAM, promoting a broader weight distribution with more near-zero values. The authors prove that VAR preserves SGD convergence under milder smoothness assumptions and demonstrate, across CNNs and Vision Transformers on classification and medical segmentation tasks, that VAR consistently preserves accuracy after aggressive pruning and yields flatter loss landscapes. Collectively, VAR offers a computationally efficient strategy to enhance structural robustness and pruning resilience in both convolutional and transformer models, with practical benefits for deployment on resource-constrained devices.

Abstract

Deep neural networks achieve outstanding performance in visual recognition tasks, yet their large number of parameters makes them less practical for real-world applications. Recently, one-shot pruning has emerged as an effective strategy for reducing model size without additional training. However, models trained with standard objective functions often suffer a significant drop in accuracy after aggressive pruning. Some existing pruning-robust optimizers, such as SAM, and CrAM, mitigate this accuracy drop by guiding the model toward flatter regions of the parameter space, but they inevitably incur non-negligible additional computations. We propose a Variance Amplifying Regularizer (VAR) that deliberately increases the variance of model parameters during training. Our study reveals an intriguing finding that parameters with higher variance exhibit greater pruning robustness. VAR exploits this property by promoting such variance in the weight distribution, thereby mitigating the adverse effects of pruning. We further provide a theoretical analysis of its convergence behavior, supported by extensive empirical results demonstrating the superior pruning robustness of VAR.

Weight Variance Amplifier Improves Accuracy in High-Sparsity One-Shot Pruning

TL;DR

This work introduces Variance Amplifying Regularizer (VAR), a lightweight training regularizer that enlarges weight variance to improve robustness of neural networks under very high sparsity from one-shot pruning. VAR integrates with standard SGD and can complement pruning-robust optimizers like SAM or CrAM, promoting a broader weight distribution with more near-zero values. The authors prove that VAR preserves SGD convergence under milder smoothness assumptions and demonstrate, across CNNs and Vision Transformers on classification and medical segmentation tasks, that VAR consistently preserves accuracy after aggressive pruning and yields flatter loss landscapes. Collectively, VAR offers a computationally efficient strategy to enhance structural robustness and pruning resilience in both convolutional and transformer models, with practical benefits for deployment on resource-constrained devices.

Abstract

Deep neural networks achieve outstanding performance in visual recognition tasks, yet their large number of parameters makes them less practical for real-world applications. Recently, one-shot pruning has emerged as an effective strategy for reducing model size without additional training. However, models trained with standard objective functions often suffer a significant drop in accuracy after aggressive pruning. Some existing pruning-robust optimizers, such as SAM, and CrAM, mitigate this accuracy drop by guiding the model toward flatter regions of the parameter space, but they inevitably incur non-negligible additional computations. We propose a Variance Amplifying Regularizer (VAR) that deliberately increases the variance of model parameters during training. Our study reveals an intriguing finding that parameters with higher variance exhibit greater pruning robustness. VAR exploits this property by promoting such variance in the weight distribution, thereby mitigating the adverse effects of pruning. We further provide a theoretical analysis of its convergence behavior, supported by extensive empirical results demonstrating the superior pruning robustness of VAR.

Paper Structure

This paper contains 26 sections, 8 theorems, 42 equations, 6 figures, 11 tables, 1 algorithm.

Key Result

Lemma 4

Under the $\beta_{1}$-smoothness of $L$ and the $\beta_{2}$-smoothness of $\psi$, the combined objective $L_{\mathrm{total}}(w)=L(w)+\lambda\psi(w)$ is $\beta$-smooth for $\beta \le \beta_{1}+\lambda\beta_{2}$. Together with the bounded-variance conditions $\mathbb{E}[\nabla L_t(w_t)] = \nabla L(w_t

Figures (6)

  • Figure 1: Weight parameters' distribution comparison of models trained with standard SGD (blue) and SGD with the proposed Variance Amplifying Regularizer (red, $\lambda$ = 1e-5). (A) ResNet-18 on CIFAR-10, (B) ResNet-50 on SVHN, and (C) WideResNet-28-10 on CIFAR-100. Across all architectures, applying Variance Amplifying Regularizer (VAR) under the same SGD training setting broadens the weight distribution.
  • Figure 2: Qualitative segmentation results on the LGG MRI dataset using the ResNet-50-UNet architecture under 85% pruning. Each column corresponds to a different methods, with and without our proposed Variance Amplifying Regularizer (VAR).
  • Figure 3: Comparison of SGD and SAM with and without the proposed Variance Amplifying Regularizer (VAR) on CIFAR-10 using ResNet-18 (training batch size 1024, trained for 300 epochs). (A,C) show test accuracy over training epochs; (B,D) show Hessian eigenvalue, which represents the maximum curvature of the loss surface. For (B,D), both raw traces (faint) and EMA-smoothed curves (bold) are presented, where the EMA is an exponential moving average applied across epochs to reduce stochastic noise.
  • Figure 4: Qualitative segmentation results on the LGG MRI dataset using the ResNet-50–UNet architecture under 85% pruning. Each column displays the outputs of different pruning-robust training methods, shown both with and without our proposed Variance Amplifying Regularizer (VAR), along with the corresponding ground-truth mask.
  • Figure 5: Qualitative segmentation results on the LGG MRI dataset using the ResNet-50–UNet architecture under 85% pruning. Each column displays the outputs of different pruning-robust training methods, shown both with and without our proposed Variance Amplifying Regularizer (VAR), along with the corresponding ground-truth mask.
  • ...and 1 more figures

Theorems & Definitions (12)

  • Lemma 4
  • Theorem 5
  • Corollary 6
  • Corollary 7: Diminishing step size
  • Lemma 4
  • proof
  • Theorem 5
  • proof
  • Corollary 6
  • proof
  • ...and 2 more