Table of Contents
Fetching ...

The Bayesian Learning Rule

Mohammad Emtiyaz Khan, Håvard Rue

TL;DR

The paper presents the Bayesian Learning Rule (BLR), a unifying variational framework that recasts many ML algorithms as instances of optimizing a posterior approximation $q(\boldsymbol{\theta})$ within an exponential-family, driven by the objective $\mathcal{L}(\boldsymbol{\lambda})=\mathbb{E}_{q}[\bar{\ell}(\boldsymbol{\theta})]-\mathcal{H}(q)$. It derives BLR updates as natural-gradient steps and via mirror-descent, showing how classic methods like gradient descent, Newton's method, and ridge regression emerge from Gaussian candidates, while extending to mixtures for multimodal optimization. The BLR is then applied to deep learning, deriving SGD, adaptive optimizers (RMSprop/Adam), dropout variants (including BayesBiNN), and uncertainty estimation approaches (OGN, VOGN), and it connects probabilistic inference (EM, SVI, VMP, non-conjugate VI) within a single principled framework. This provides a coherent blueprint for designing new algorithms and for understanding the Bayesian content of existing ones, with practical implications for robust learning and scalable inference. The approach emphasizes the role of entropy and information geometry, enabling principled automatic complexity control through the choice of posterior family and yielding uncertainty estimates alongside optimized solutions.

Abstract

We show that many machine-learning algorithms are specific instances of a single algorithm called the \emph{Bayesian learning rule}. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton's method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.

The Bayesian Learning Rule

TL;DR

The paper presents the Bayesian Learning Rule (BLR), a unifying variational framework that recasts many ML algorithms as instances of optimizing a posterior approximation within an exponential-family, driven by the objective . It derives BLR updates as natural-gradient steps and via mirror-descent, showing how classic methods like gradient descent, Newton's method, and ridge regression emerge from Gaussian candidates, while extending to mixtures for multimodal optimization. The BLR is then applied to deep learning, deriving SGD, adaptive optimizers (RMSprop/Adam), dropout variants (including BayesBiNN), and uncertainty estimation approaches (OGN, VOGN), and it connects probabilistic inference (EM, SVI, VMP, non-conjugate VI) within a single principled framework. This provides a coherent blueprint for designing new algorithms and for understanding the Bayesian content of existing ones, with practical implications for robust learning and scalable inference. The approach emphasizes the role of entropy and information geometry, enabling principled automatic complexity control through the choice of posterior family and yielding uncertainty estimates alongside optimized solutions.

Abstract

We show that many machine-learning algorithms are specific instances of a single algorithm called the \emph{Bayesian learning rule}. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton's method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.

Paper Structure

This paper contains 35 sections, 118 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Bayesian solutions have similar robustness properties to flatter minima found in deep learning via stochastic algorithms. Panel (a): When the minimum lies right next to a 'wall', the Bayesian solution shifts away from the wall (towards the flatter side) to avoid large losses under small perturbation. This is due to the averaging over $q_*(\text{$\boldsymbol{\theta}$})$ in condition \ref{['eq:conv_cond_1']}. Panel (b): Given a sharp minimum vs a flat minimum, the Bayesian solution often prefers the flatter minimum, which is again due to the averaging over $q_*(\text{$\boldsymbol{\theta}$})$.