Table of Contents
Fetching ...

Bayesian sparsification for deep neural networks with Bayesian model reduction

Dimitrije Marković, Karl J. Friston, Stefan J. Kiebel

TL;DR

This paper addresses the computational bottleneck of Bayesian sparsification in deep neural networks by introducing Bayesian model reduction (BMR) as a scalable, post-hoc pruning criterion. It integrates BMR with variational inference (including stochastic and black-box variants) to prune weights using posterior evidence comparisons, contrasting spike-and-slab and horseshoe priors and demonstrating efficiency through a fully factorised mean-field approach. Empirical results across architectures from LeNet to Vision Transformers and MLP-Mixers show competitive accuracy alongside strong sparsity and improved calibration metrics (lower NLL and ECE) relative to baselines, with BMR offering faster convergence. The work provides practical pruning criteria via the change in variational free energy ΔF and suggests broad applicability to probabilistic ML tasks, offering a pathway to scalable, uncertainty-aware model compression.

Abstract

Deep learning's immense capabilities are often constrained by the complexity of its models, leading to an increasing demand for effective sparsification techniques. Bayesian sparsification for deep learning emerges as a crucial approach, facilitating the design of models that are both computationally efficient and competitive in terms of performance across various deep learning applications. The state-of-the-art -- in Bayesian sparsification of deep neural networks -- combines structural shrinkage priors on model weights with an approximate inference scheme based on stochastic variational inference. However, model inversion of the full generative model is exceptionally computationally demanding, especially when compared to standard deep learning of point estimates. In this context, we advocate for the use of Bayesian model reduction (BMR) as a more efficient alternative for pruning of model weights. As a generalization of the Savage-Dickey ratio, BMR allows a post-hoc elimination of redundant model weights based on the posterior estimates under a straightforward (non-hierarchical) generative model. Our comparative study highlights the advantages of the BMR method relative to established approaches based on hierarchical horseshoe priors over model weights. We illustrate the potential of BMR across various deep learning architectures, from classical networks like LeNet to modern frameworks such as Vision Transformers and MLP-Mixers.

Bayesian sparsification for deep neural networks with Bayesian model reduction

TL;DR

This paper addresses the computational bottleneck of Bayesian sparsification in deep neural networks by introducing Bayesian model reduction (BMR) as a scalable, post-hoc pruning criterion. It integrates BMR with variational inference (including stochastic and black-box variants) to prune weights using posterior evidence comparisons, contrasting spike-and-slab and horseshoe priors and demonstrating efficiency through a fully factorised mean-field approach. Empirical results across architectures from LeNet to Vision Transformers and MLP-Mixers show competitive accuracy alongside strong sparsity and improved calibration metrics (lower NLL and ECE) relative to baselines, with BMR offering faster convergence. The work provides practical pruning criteria via the change in variational free energy ΔF and suggests broad applicability to probabilistic ML tasks, offering a pathway to scalable, uncertainty-aware model compression.

Abstract

Deep learning's immense capabilities are often constrained by the complexity of its models, leading to an increasing demand for effective sparsification techniques. Bayesian sparsification for deep learning emerges as a crucial approach, facilitating the design of models that are both computationally efficient and competitive in terms of performance across various deep learning applications. The state-of-the-art -- in Bayesian sparsification of deep neural networks -- combines structural shrinkage priors on model weights with an approximate inference scheme based on stochastic variational inference. However, model inversion of the full generative model is exceptionally computationally demanding, especially when compared to standard deep learning of point estimates. In this context, we advocate for the use of Bayesian model reduction (BMR) as a more efficient alternative for pruning of model weights. As a generalization of the Savage-Dickey ratio, BMR allows a post-hoc elimination of redundant model weights based on the posterior estimates under a straightforward (non-hierarchical) generative model. Our comparative study highlights the advantages of the BMR method relative to established approaches based on hierarchical horseshoe priors over model weights. We illustrate the potential of BMR across various deep learning architectures, from classical networks like LeNet to modern frameworks such as Vision Transformers and MLP-Mixers.
Paper Structure (13 sections, 30 equations, 6 figures)

This paper contains 13 sections, 30 equations, 6 figures.

Figures (6)

  • Figure 1: Classification performance comparison on FashoinMNIST dataset for different neuronal architectures and approximate inference schemes.
  • Figure 2: Total fraction of pruned model parameters obtained with the stochastic BMR algorithm across different DNN architectures and datasets.
  • Figure 3: Cumulative Distribution Function (CDF) of absolute posterior parameter expectations at different layers of MLP (top row), and LeNet architectures (bottom row). The y-axis represents the fraction of parameters with values less than or equal to the value on the x-axis.
  • Figure 4: Posterior expectations (color coded) over model parameters obtained using different approximate inference schemes at the first layer of (a) MLP architecture, and (b) LeNet architectures.
  • Figure S1: Classification performance comparison on CIFAR10 dataset for different neuronal architectures and approximate inference schemes.
  • ...and 1 more figures