Table of Contents
Fetching ...

Bayesian Ensembling: Insights from Online Optimization and Empirical Bayes

Daniel Waxman, Fernando Llorente, Petar M. Djurić

TL;DR

This work investigates online Bayesian ensembling, contrasting Bayesian model averaging (BMA) with Bayesian stacking (BS) and introducing Online Bayesian Stacking (OBS) as an online, empirical Bayes–inspired alternative. By reframing OBS as online portfolio selection (OPS), the authors leverage regret analysis and efficient online convex optimization (OCO) algorithms (e.g., Exponentiated Gradient, Online Newton Step) to derive performance guarantees and practical guidance. They establish that OBS often outperforms online BMA (O-BMA) and dynamic model averaging (DMA), especially in nonstationary or M-open settings, while providing a hybrid approach when M-closed assumptions may hold. Through extensive experiments across subset linear regression, online variational inference, and time-series forecasting, the paper demonstrates OBS’s robustness and practical advantages, offering actionable recommendations on when and how to deploy OBS in online Bayesian learning. The work bridges Bayesian ensemble learning with portfolio theory, enabling principled, scalable online inference for modern Bayesian models.

Abstract

We revisit the classical problem of Bayesian ensembles and address the challenge of learning optimal combinations of Bayesian models in an online, continual learning setting. To this end, we reinterpret existing approaches such as Bayesian model averaging (BMA) and Bayesian stacking through a novel empirical Bayes lens, shedding new light on the limitations and pathologies of BMA. Further motivated by insights from online optimization, we propose Online Bayesian Stacking (OBS), a method that optimizes the log-score over predictive distributions to adaptively combine Bayesian models. A key contribution of our work is establishing a novel connection between OBS and portfolio selection, bridging Bayesian ensemble learning with a rich, well-studied theoretical framework that offers efficient algorithms and extensive regret analysis. We further clarify the relationship between OBS and online BMA, showing that they optimize related but distinct cost functions. Through theoretical analysis and empirical evaluation, we identify scenarios where OBS outperforms online BMA and provide principled methods and guidance on when practitioners should prefer one approach over the other.

Bayesian Ensembling: Insights from Online Optimization and Empirical Bayes

TL;DR

This work investigates online Bayesian ensembling, contrasting Bayesian model averaging (BMA) with Bayesian stacking (BS) and introducing Online Bayesian Stacking (OBS) as an online, empirical Bayes–inspired alternative. By reframing OBS as online portfolio selection (OPS), the authors leverage regret analysis and efficient online convex optimization (OCO) algorithms (e.g., Exponentiated Gradient, Online Newton Step) to derive performance guarantees and practical guidance. They establish that OBS often outperforms online BMA (O-BMA) and dynamic model averaging (DMA), especially in nonstationary or M-open settings, while providing a hybrid approach when M-closed assumptions may hold. Through extensive experiments across subset linear regression, online variational inference, and time-series forecasting, the paper demonstrates OBS’s robustness and practical advantages, offering actionable recommendations on when and how to deploy OBS in online Bayesian learning. The work bridges Bayesian ensemble learning with portfolio theory, enabling principled, scalable online inference for modern Bayesian models.

Abstract

We revisit the classical problem of Bayesian ensembles and address the challenge of learning optimal combinations of Bayesian models in an online, continual learning setting. To this end, we reinterpret existing approaches such as Bayesian model averaging (BMA) and Bayesian stacking through a novel empirical Bayes lens, shedding new light on the limitations and pathologies of BMA. Further motivated by insights from online optimization, we propose Online Bayesian Stacking (OBS), a method that optimizes the log-score over predictive distributions to adaptively combine Bayesian models. A key contribution of our work is establishing a novel connection between OBS and portfolio selection, bridging Bayesian ensemble learning with a rich, well-studied theoretical framework that offers efficient algorithms and extensive regret analysis. We further clarify the relationship between OBS and online BMA, showing that they optimize related but distinct cost functions. Through theoretical analysis and empirical evaluation, we identify scenarios where OBS outperforms online BMA and provide principled methods and guidance on when practitioners should prefer one approach over the other.

Paper Structure

This paper contains 61 sections, 5 theorems, 51 equations, 10 figures, 5 tables, 1 algorithm.

Key Result

Proposition 3.1

Let the regret of the BMA mixture with respect to the best individual model be defined as $R_T = \sum_t \log p_{k^{*}}(y_{t} \,|\, \boldsymbol{\mathbf{x}}_t, \mathcal{D}_{t-1}) -\sum_t \log\left(\sum_k w_{t, k} p_k(y_t \,|\, \boldsymbol{\mathbf{x}}_t, \mathcal{D}_{t-1}) \right),$ where $k_*$ is the

Figures (10)

  • Figure 1: The average predictive log-likelihood (higher is better) in the toy example. 'EG" is exponentiated gradients, "ONS" is the Online Newton Step, "BCRP" is the optimal constant rebalanced portfolio (offline baseline), and "O-BMA" is O-BMA. Lines denote the median and shaded area represent the 10th to 90th percentiles over 10 trials. The first 100 samples are suppressed for readability.
  • Figure 2: The final weights in the "open" and "closed" subset linear regression experiment.
  • Figure 3: The evolution of the weight vector $\boldsymbol{\mathbf{w}}_t$ as a function of $t$ in the "open" and "closed" subset linear regression experiment. Results are shown for a single trial due to the noisy nature of the plots. Dots on the right side of a plot denote the final weights of the BCRP.
  • Figure 4: The average predictive log-likelihood (higher is better) in the MNIST and forecasting experiments, respectively. The method descriptions follow those in \ref{['fig:exp_results']}, with the addition of Soft-Bayes. Lines denote the median and shaded area represents the 10th to 90th percentiles over 10 trials. The first 100 samples are suppressed for readability.
  • Figure 5: The final weights in the online variational inference experiment.
  • ...and 5 more figures

Theorems & Definitions (8)

  • Proposition 3.1
  • Corollary 3.2
  • Theorem 3.3
  • proof
  • proof
  • Lemma A.1: kakade2004online, Theorem 2.2
  • Theorem A.2
  • proof