Table of Contents
Fetching ...

Scalable Exploration via Ensemble++

Yingru Li, Jiawei Xu, Baoxiang Wang, Zhi-Quan Luo

TL;DR

This work tackles the scalability challenge of Thompson Sampling in large-scale and non-conjugate bandit settings. It introduces Ensemble++, a unified framework that maintains a single shared ensemble factor updated incrementally, enabling approximate posterior sampling with a small ensemble size $M = \Theta(d \log T)$ in linear bandits and extending naturally to nonlinear rewards via a learnable neural feature extractor. The authors provide a theoretical backbone, proving near-optimal regret guarantees for Linear Ensemble++ and detailing a symmetrized ridge-regression view that underpins both linear and neural variants; they also establish a sequential Johnson-Lindenstrauss-based covariance-tracking result to handle adaptivity. Empirically, Ensemble++ achieves superior regret–compute trade-offs across linear, quadratic, neural, and GPT-based contextual bandits, outperforming state-of-the-art baselines such as Ensemble+ and EpiNet, and demonstrates practical applicability to real-world tasks like content moderation with large foundation models. Overall, Ensemble++ bridges theory and practice, delivering scalable Bayesian exploration with strong regret guarantees and broad applicability to non-conjugate, high-dimensional problems.

Abstract

Thompson Sampling is a principled method for balancing exploration and exploitation, but its real-world adoption faces computational challenges in large-scale or non-conjugate settings. While ensemble-based approaches offer partial remedies, they typically require prohibitively large ensemble sizes. We propose Ensemble++, a scalable exploration framework using a novel shared-factor ensemble architecture with random linear combinations. For linear bandits, we provide theoretical guarantees showing that Ensemble++ achieves regret comparable to exact Thompson Sampling with only $Θ(d \log T)$ ensemble sizes--significantly outperforming prior methods. Crucially, this efficiency holds across both compact and finite action sets with either time-invariant or time-varying contexts without configuration changes. We extend this theoretical foundation to nonlinear rewards by replacing fixed features with learnable neural representations while preserving the same incremental update principle, effectively bridging theory and practice for real-world tasks. Comprehensive experiments across linear, quadratic, neural, and GPT-based contextual bandits validate our theoretical findings and demonstrate Ensemble++'s superior regret-computation tradeoff versus state-of-the-art methods.

Scalable Exploration via Ensemble++

TL;DR

This work tackles the scalability challenge of Thompson Sampling in large-scale and non-conjugate bandit settings. It introduces Ensemble++, a unified framework that maintains a single shared ensemble factor updated incrementally, enabling approximate posterior sampling with a small ensemble size in linear bandits and extending naturally to nonlinear rewards via a learnable neural feature extractor. The authors provide a theoretical backbone, proving near-optimal regret guarantees for Linear Ensemble++ and detailing a symmetrized ridge-regression view that underpins both linear and neural variants; they also establish a sequential Johnson-Lindenstrauss-based covariance-tracking result to handle adaptivity. Empirically, Ensemble++ achieves superior regret–compute trade-offs across linear, quadratic, neural, and GPT-based contextual bandits, outperforming state-of-the-art baselines such as Ensemble+ and EpiNet, and demonstrates practical applicability to real-world tasks like content moderation with large foundation models. Overall, Ensemble++ bridges theory and practice, delivering scalable Bayesian exploration with strong regret guarantees and broad applicability to non-conjugate, high-dimensional problems.

Abstract

Thompson Sampling is a principled method for balancing exploration and exploitation, but its real-world adoption faces computational challenges in large-scale or non-conjugate settings. While ensemble-based approaches offer partial remedies, they typically require prohibitively large ensemble sizes. We propose Ensemble++, a scalable exploration framework using a novel shared-factor ensemble architecture with random linear combinations. For linear bandits, we provide theoretical guarantees showing that Ensemble++ achieves regret comparable to exact Thompson Sampling with only ensemble sizes--significantly outperforming prior methods. Crucially, this efficiency holds across both compact and finite action sets with either time-invariant or time-varying contexts without configuration changes. We extend this theoretical foundation to nonlinear rewards by replacing fixed features with learnable neural representations while preserving the same incremental update principle, effectively bridging theory and practice for real-world tasks. Comprehensive experiments across linear, quadratic, neural, and GPT-based contextual bandits validate our theoretical findings and demonstrate Ensemble++'s superior regret-computation tradeoff versus state-of-the-art methods.
Paper Structure (143 sections, 10 theorems, 131 equations, 19 figures, 5 tables, 7 algorithms)

This paper contains 143 sections, 10 theorems, 131 equations, 19 figures, 5 tables, 7 algorithms.

Key Result

Lemma 4.2

Let $\{{\mathbf{z}}_t\}_{t=1}^T$ be random vectors in $\mathbb{R}^M$ such that they are conditionally $\sqrt{1/M}$-sub-Gaussian and are unit-norm almost surely. Define ${\mathbf{A}}_t$ via the recursive update eq:incre-matrix-factor in Linear Ensemble++, and let ${\bm{\varSigma}}_t$ be the exact rid where $s_{\min}^2 = \inf_{\|a\|=1} \|a\|^2_{{\bm{\varSigma}}_0^{-1}}$ and $s_{\max}^2 = \sup_{\|a\|

Figures (19)

  • Figure 1: Preview of the regret-computation trade-off in a nonlinear bandit (unknown quadratic reward). The x-axis is the number of model parameters, serving as a proxy for computational cost. Ensemble++ outperforms state-of-the-art baselines Ensemble+ and EpiNet. See details in \ref{['sec:exp']}.
  • Figure 2: Performance in Finite-action Linear Bandit. All experiments are repeated over 200 random runs to ensure robust results. (a) Comparison results under $d=50$ and $|\mathcal{X}|=10,000$. ES denotes Linear Ensemble Sampling, ES++ denotes Linear Ensemble++ Sampling, and TS refers to Thompson Sampling. (b) Minimum ensemble size $M$ required for Linear Ensemble++ Sampling to match the regret performance of TS.
  • Figure 3: Comparison results across various nonlinear bandits.
  • Figure 4: Scalability of Ensemble++ with varying decision set sizes ($|\mathcal{X}|$).
  • Figure 5: Sequential dependence due to the interplay between recursive updates and sequential decision-making. $Z_1, \ldots, Z_t$ are no longer independent and identically distributed (i.i.d.) when conditioned on the data ${X_{t+1}}$.
  • ...and 14 more figures

Theorems & Definitions (31)

  • Lemma 4.2: Covariance Tracking via Sequential JL Variant
  • Theorem 4.3: Distribution-dependent Regret for Linear Ensemble++
  • Remark A.1
  • Example 1: Approximate Posterior in a Multi-Armed Bandit
  • Definition C.1: Adapted Stochastic Process
  • Definition C.2: Conditionally $\sigma$-Sub-Gaussian Random Variable / Process
  • Definition C.3: Almost Surely Unit-Norm
  • Remark C.4: Properties of Perturbation Distribution $P_{{\mathbf{z}}}$
  • Lemma C.5: JL lemma johnson1984extensions
  • Theorem C.6: Sequential Johnson-Lindenstrauss Theorem, adapted from li2024probability
  • ...and 21 more