Compression-aware Training of Neural Networks using Frank-Wolfe

Max Zimmer; Christoph Spiegel; Sebastian Pokutta

Compression-aware Training of Neural Networks using Frank-Wolfe

Max Zimmer, Christoph Spiegel, Sebastian Pokutta

TL;DR

This work tackles training neural networks that remain accurate under compression (pruning and low-rank decomposition) without retraining. It introduces a compression-aware framework built on Stochastic Frank-Wolfe with norm-constrained feasible regions, notably the group-$k$-support and spectral-$k$-support norms, to drive structured sparsity and low-rankness during training. A gradient-rescaled learning rate is shown to be crucial for convergence and pruning stability, with a theoretical convergence result in the non-convex stochastic setting. Empirically, SparseFW achieves competitive or superior performance across image classification and semantic segmentation tasks and offers efficiency gains over nuclear-norm based methods, indicating practical impact for deployable, compression-tolerant models.

Abstract

Many existing Neural Network pruning approaches rely on either retraining or inducing a strong bias in order to converge to a sparse solution throughout training. A third paradigm, 'compression-aware' training, aims to obtain state-of-the-art dense models that are robust to a wide range of compression ratios using a single dense training run while also avoiding retraining. We propose a framework centered around a versatile family of norm constraints and the Stochastic Frank-Wolfe (SFW) algorithm that encourage convergence to well-performing solutions while inducing robustness towards convolutional filter pruning and low-rank matrix decomposition. Our method is able to outperform existing compression-aware approaches and, in the case of low-rank matrix decomposition, it also requires significantly less computational resources than approaches based on nuclear-norm regularization. Our findings indicate that dynamically adjusting the learning rate of SFW, as suggested by Pokutta et al. (2020), is crucial for convergence and robustness of SFW-trained models and we establish a theoretical foundation for that practice.

Compression-aware Training of Neural Networks using Frank-Wolfe

TL;DR

-support and spectral-

-support norms, to drive structured sparsity and low-rankness during training. A gradient-rescaled learning rate is shown to be crucial for convergence and pruning stability, with a theoretical convergence result in the non-convex stochastic setting. Empirically, SparseFW achieves competitive or superior performance across image classification and semantic segmentation tasks and offers efficiency gains over nuclear-norm based methods, indicating practical impact for deployable, compression-tolerant models.

Abstract

Paper Structure (39 sections, 6 theorems, 37 equations, 8 figures, 11 tables, 1 algorithm)

This paper contains 39 sections, 6 theorems, 37 equations, 8 figures, 11 tables, 1 algorithm.

Introduction
Contributions.
Related Work.
Outline.
Preliminaries
Constrained optimization using Stochastic Frank-Wolfe.
Inducing structure through the feasible region.
Methodology: Compression-aware Training
Inducing group sparsity to Neural Networks
Unstructured sparsity as a special case
A different sparsity notion: pruning singular values
Experimental Results
Compression awareness: Structured Filter Pruning
Compression awareness: Low-Rank Decomposition
The impact of the learning rate schedule
...and 24 more sections

Key Result

Lemma 3.1

Let $\mathcal{W}_t = -\tau{\Vert \sigma(\Sigma_k) \Vert }_2^{-1} U_k \Sigma_k V_k^T \in \mathcal{C}^{\sigma}_k(\tau)$, where $U_k \Sigma_k V_k^T$ is the truncated $k$-SVD of $\nabla_t$ such that only the $k$ largest singular values are kept. Then $\mathcal{W}_t$ is a solution to eq:LMO.

Figures (8)

Figure 1: ResNet-18 on CIFAR-10: Relative distance to filter pruned model corresponding to 70% sparsity when training with the proposed approach and varying $k$.
Figure 2: Accuracy-vs.-sparsity tradeoff curves for structured convolutional filter pruning on ImageNet. The plots show the parameter configuration with highest test accuracy after pruning when averaging over all sparsities at stake.
Figure 3: Test performance-vs.-sparsity tradeoff curves for low-rank tensor decomposition on CIFAR100 (left), CityScapes (middle) and TinyImageNet (right).
Figure 4: ResNet-18 on CIFAR-10: For each pruning amount, the best hyperparameter configuration w.r.t. the accuracy after pruning (pruned) is depicted. The corresponding value before pruning (dense) is depicted as a dashed line.
Figure 5: ResNet-18 on CIFAR-10: Accuracy-vs.-sparsity tradeoff curves for unstructured weight pruning comparing our approach to the existing $k$-sparse approach.
...and 3 more figures

Theorems & Definitions (9)

Lemma 3.1
Theorem 4.1: Convergence of gradient rescaling, informal
Lemma 1.1
proof
Lemma 1.2
proof
Lemma 1.3
Theorem 1.4
proof

Compression-aware Training of Neural Networks using Frank-Wolfe

TL;DR

Abstract

Compression-aware Training of Neural Networks using Frank-Wolfe

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (9)