Table of Contents
Fetching ...

SWAP: Sparse Entropic Wasserstein Regression for Robust Network Pruning

Lei You, Hei Victor Cheng

TL;DR

SWAP reframes network pruning under noisy gradients as Sparse Entropic Wasserstein regression (EWR), leveraging entropically regularized optimal transport to interpolate among gradient neighborhoods. By minimizing the Wasserstein distance between projected gradients before and after pruning, plus a sparsity constraint and a small quadratic penalty, SWAP achieves robustness to noise while preserving useful covariance information. Theoretical results (via convex hull and neighborhood interpolation) and entropic regularization improve sample efficiency and stabilize pruning decisions; empirically, SWAP matches or surpasses SoTA pruning methods, with pronounced gains at high sparsity and in the presence of gradient noise, as demonstrated on MLPNet, ResNet, and MobileNetV1 across multiple datasets. The approach offers a scalable, robust alternative for large-scale model compression, with practical impact for deploying efficient, resilient neural networks in resource-constrained environments.

Abstract

This study addresses the challenge of inaccurate gradients in computing the empirical Fisher Information Matrix during neural network pruning. We introduce SWAP, a formulation of Entropic Wasserstein regression (EWR) for pruning, capitalizing on the geometric properties of the optimal transport problem. The ``swap'' of the commonly used linear regression with the EWR in optimization is analytically demonstrated to offer noise mitigation effects by incorporating neighborhood interpolation across data points with only marginal additional computational cost. The unique strength of SWAP is its intrinsic ability to balance noise reduction and covariance information preservation effectively. Extensive experiments performed on various networks and datasets show comparable performance of SWAP with state-of-the-art (SoTA) network pruning algorithms. Our proposed method outperforms the SoTA when the network size or the target sparsity is large, the gain is even larger with the existence of noisy gradients, possibly from noisy data, analog memory, or adversarial attacks. Notably, our proposed method achieves a gain of 6% improvement in accuracy and 8% improvement in testing loss for MobileNetV1 with less than one-fourth of the network parameters remaining.

SWAP: Sparse Entropic Wasserstein Regression for Robust Network Pruning

TL;DR

SWAP reframes network pruning under noisy gradients as Sparse Entropic Wasserstein regression (EWR), leveraging entropically regularized optimal transport to interpolate among gradient neighborhoods. By minimizing the Wasserstein distance between projected gradients before and after pruning, plus a sparsity constraint and a small quadratic penalty, SWAP achieves robustness to noise while preserving useful covariance information. Theoretical results (via convex hull and neighborhood interpolation) and entropic regularization improve sample efficiency and stabilize pruning decisions; empirically, SWAP matches or surpasses SoTA pruning methods, with pronounced gains at high sparsity and in the presence of gradient noise, as demonstrated on MLPNet, ResNet, and MobileNetV1 across multiple datasets. The approach offers a scalable, robust alternative for large-scale model compression, with practical impact for deploying efficient, resilient neural networks in resource-constrained environments.

Abstract

This study addresses the challenge of inaccurate gradients in computing the empirical Fisher Information Matrix during neural network pruning. We introduce SWAP, a formulation of Entropic Wasserstein regression (EWR) for pruning, capitalizing on the geometric properties of the optimal transport problem. The ``swap'' of the commonly used linear regression with the EWR in optimization is analytically demonstrated to offer noise mitigation effects by incorporating neighborhood interpolation across data points with only marginal additional computational cost. The unique strength of SWAP is its intrinsic ability to balance noise reduction and covariance information preservation effectively. Extensive experiments performed on various networks and datasets show comparable performance of SWAP with state-of-the-art (SoTA) network pruning algorithms. Our proposed method outperforms the SoTA when the network size or the target sparsity is large, the gain is even larger with the existence of noisy gradients, possibly from noisy data, analog memory, or adversarial attacks. Notably, our proposed method achieves a gain of 6% improvement in accuracy and 8% improvement in testing loss for MobileNetV1 with less than one-fourth of the network parameters remaining.
Paper Structure (16 sections, 1 theorem, 33 equations, 11 figures, 7 tables, 3 algorithms)

This paper contains 16 sections, 1 theorem, 33 equations, 11 figures, 7 tables, 3 algorithms.

Key Result

Proposition 1

Consider a set $S$ and its convex hull $\textit{Conv}(S)$ in a Euclidean space, and an arbitrary point ${x}$ in the space. For any probability measure $\hat{\nu}$ on $S$, we can find a point ${y}'$ in $\textit{Conv}(S)$ as ${y}' = \int {y} \, \mathrm{d}\nu({y})$ such that $\|{x} - {y}'\|^2 = \int \|

Figures (11)

  • Figure 1: Comparison between the Sinkhorn-Knopp (i.e. Algorithm \ref{['alg:sinkhorn']}) and the closed-form solution (i.e. Algorithm \ref{['alg:ot-closed-form']}). The plot is made based on the data of ResNet20 trained on Cifar10. The relative difference is computed by (red - blue) / blue.
  • Figure 2: Loss reduction with different $\varepsilon$. The result is averaged over 25 runs for ResNet20, with 10% Noisy data and noise level $\sigma$. The error bar shows 90% confidence interval. The target sparsity is 0.95.
  • Figure 3: EWR Loss in function of the sparsity. The result is obtained over 25 runs on ResNet20, with 10% Noisy data and noise level $\sigma$. The target sparsity is 0.95.
  • Figure 4: Difference in loss between LR and EWR for ResNet20. The data is from \ref{['tab:loss_resnet20']}. The relative loss improvement of EWR over LR is reported. The target sparsity is 0.95.
  • Figure 5: Difference in loss between LR and EWR for MobileNetV1. The data is from \ref{['tab:loss_mobilenetv1']}. The relative loss improvement of EWR over LR is reported. The target sparsity is 0.75.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Proposition 1: Convex Hull Distance Equality
  • proof