The Bayesian Learning Rule
Mohammad Emtiyaz Khan, Håvard Rue
TL;DR
The paper presents the Bayesian Learning Rule (BLR), a unifying variational framework that recasts many ML algorithms as instances of optimizing a posterior approximation $q(\boldsymbol{\theta})$ within an exponential-family, driven by the objective $\mathcal{L}(\boldsymbol{\lambda})=\mathbb{E}_{q}[\bar{\ell}(\boldsymbol{\theta})]-\mathcal{H}(q)$. It derives BLR updates as natural-gradient steps and via mirror-descent, showing how classic methods like gradient descent, Newton's method, and ridge regression emerge from Gaussian candidates, while extending to mixtures for multimodal optimization. The BLR is then applied to deep learning, deriving SGD, adaptive optimizers (RMSprop/Adam), dropout variants (including BayesBiNN), and uncertainty estimation approaches (OGN, VOGN), and it connects probabilistic inference (EM, SVI, VMP, non-conjugate VI) within a single principled framework. This provides a coherent blueprint for designing new algorithms and for understanding the Bayesian content of existing ones, with practical implications for robust learning and scalable inference. The approach emphasizes the role of entropy and information geometry, enabling principled automatic complexity control through the choice of posterior family and yielding uncertainty estimates alongside optimized solutions.
Abstract
We show that many machine-learning algorithms are specific instances of a single algorithm called the \emph{Bayesian learning rule}. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton's method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.
