Bayesian Lottery Ticket Hypothesis
Nicholas Kuhn, Arvid Weyrauch, Lars Heyen, Achim Streit, Markus Götz, Charlotte Debus
TL;DR
The paper addresses the computational burden of Bayesian neural networks (BNNs) by exploring whether the Lottery Ticket Hypothesis (LTH) holds in Bayesian settings. It translates Iterative Magnitude Pruning (IMP) to mean-field variational Bayes across CNNs and a Vision Transformer on CIFAR-10, comparing Bayesian tickets to deterministic baselines and examining transplantation of non-Bayesian tickets into BNNs. Key findings show that Bayesian lottery tickets exist across architectures, with deeper layers pruned more heavily and pruning strategies that emphasize mean magnitude and uncertainty outperforming others in high sparsity; a transplantation approach can reduce training time by 3–7x while maintaining calibration. This suggests sparse Bayesian training is feasible and calibration-friendly, offering practical routes to efficient uncertainty-aware models, especially when computational resources are limited.
Abstract
Bayesian neural networks (BNNs) are a useful tool for uncertainty quantification, but require substantially more computational resources than conventional neural networks. For non-Bayesian networks, the Lottery Ticket Hypothesis (LTH) posits the existence of sparse subnetworks that can train to the same or even surpassing accuracy as the original dense network. Such sparse networks can lower the demand for computational resources at inference, and during training. The existence of the LTH and corresponding sparse subnetworks in BNNs could motivate the development of sparse training algorithms and provide valuable insights into the underlying training process. Towards this end, we translate the LTH experiments to a Bayesian setting using common computer vision models. We investigate the defining characteristics of Bayesian lottery tickets, and extend our study towards a transplantation method connecting BNNs with deterministic Lottery Tickets. We generally find that the LTH holds in BNNs, and winning tickets of matching and surpassing accuracy are present independent of model size, with degradation at very high sparsities. However, the pruning strategy should rely primarily on magnitude, secondly on standard deviation. Furthermore, our results demonstrate that models rely on mask structure and weight initialization to varying degrees.
