Shaving Weights with Occam's Razor: Bayesian Sparsification for Neural Networks Using the Marginal Likelihood

Rayen Dhahri; Alexander Immer; Betrand Charpentier; Stephan Günnemann; Vincent Fortuin

Shaving Weights with Occam's Razor: Bayesian Sparsification for Neural Networks Using the Marginal Likelihood

Rayen Dhahri, Alexander Immer, Betrand Charpentier, Stephan Günnemann, Vincent Fortuin

TL;DR

This work presents Sparsifiability via the Marginal likelihood (SpaM), a pruning framework that highlights the effectiveness of using the Bayesian marginal likelihood in conjunction with sparsity-inducing priors for making neural networks more sparsifiable.

Abstract

Neural network sparsification is a promising avenue to save computational time and memory costs, especially in an age where many successful AI models are becoming too large to naïvely deploy on consumer hardware. While much work has focused on different weight pruning criteria, the overall sparsifiability of the network, i.e., its capacity to be pruned without quality loss, has often been overlooked. We present Sparsifiability via the Marginal likelihood (SpaM), a pruning framework that highlights the effectiveness of using the Bayesian marginal likelihood in conjunction with sparsity-inducing priors for making neural networks more sparsifiable. Our approach implements an automatic Occam's razor that selects the most sparsifiable model that still explains the data well, both for structured and unstructured sparsification. In addition, we demonstrate that the pre-computed posterior Hessian approximation used in the Laplace approximation can be re-used to define a cheap pruning criterion, which outperforms many existing (more expensive) approaches. We demonstrate the effectiveness of our framework, especially at high sparsity levels, across a range of different neural network architectures and datasets.

Shaving Weights with Occam's Razor: Bayesian Sparsification for Neural Networks Using the Marginal Likelihood

TL;DR

Abstract

Paper Structure (43 sections, 2 theorems, 11 equations, 19 figures, 6 tables, 1 algorithm)

This paper contains 43 sections, 2 theorems, 11 equations, 19 figures, 6 tables, 1 algorithm.

Introduction
Background
Marginal Likelihood for Deep Learning
Neural Network Pruning
Shaving Weights with Occam's Razor
Structured Priors for Regularization
Learning Regularization with the Marginal Likelihood
Optimal Posterior Damage (OPD)
Related work
Laplace-approximated BNNs.
Pruning neural networks.
Experiments
SpaM Improves Performance at High Sparsities
Influence of Priors on Sparsifiability
SpaM Extends to Structured Sparsification
...and 28 more sections

Key Result

Proposition 3.0

Considering the Frobenius norm, the optimal diagonal perturbation of the KFAC eigenvalues ${\boldsymbol{\Lambda}\xspace}_A \otimes {\boldsymbol{\Lambda}\xspace}_B$ to add a diagonal prior precision is given by ${\boldsymbol{\Lambda}\xspace}_A \otimes {\boldsymbol{\Lambda}\xspace}_B + \hat{{\boldsymb

Figures (19)

Figure 1: Overview of our proposed SpaM method. We start by training the network to maximize the marginal likelihood using the Laplace approximation, while simplifying the Hessian computation through either the KFAC or a diagonal approximation. We can then use our precomputed posterior precision as a pruning criterion (OPD). For the case of unstructured pruning, we compute thresholds to achieve different target sparsities, compute the mask, and apply it, while for the structured approach, we aggregate the score per layer for easier weight transfer, compute the mask, and then delete the masked structures to obtain a smaller model.
Figure 2: Predictive performance as a function of sparsity level in unstructured pruning. We see that SpaM improves the performance over MAP training across most architectures, datasets, and pruning criteria, and that OPD often outperforms the other pruning criteria. Both of these effects are particularly visible at higher sparsity levels. The black star in each subfigure denotes the performance of the unpruned models, which is often identical to the performance of models pruned at 20% sparsity.
Figure 3: Uncertainty estimation with pruned ResNets on CIFAR-10. We see that SpaM improves uncertainty estimation in terms of NLL, ECE, and Brier score for many pruning criteria and that our OPD criterion outperforms the other criteria, especially at high sparsities.
Figure 4: Comparison of different priors and Hessian approximations for SpaM-OPD unstructured pruning. The unit-wise and parameter-wise priors show better performance at high sparsity levels, with the parameter-wise one bridging the gap between Diag and KFAC LA.
Figure 5: Similarly to unstructured pruning, we see in this experiment on structured pruning that SpaM (using a unit-wise prior) improves performance over MAP and that OPD mostly outperforms other pruning criteria, especially at higher sparsity levels. The black stars reflect the performance of the unpruned models.
...and 14 more figures

Theorems & Definitions (3)

Proposition 3.0: Diagonal Prior in KFAC Eigenbasis
Proposition A.0: Diagonal Prior in KFAC Eigenbasis
proof

Shaving Weights with Occam's Razor: Bayesian Sparsification for Neural Networks Using the Marginal Likelihood

TL;DR

Abstract

Shaving Weights with Occam's Razor: Bayesian Sparsification for Neural Networks Using the Marginal Likelihood

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (3)