Bandit Allocational Instability

Yilun Chen; Jiaqi Lu

Bandit Allocational Instability

Yilun Chen, Jiaqi Lu

TL;DR

This work identifies allocational instability as a fundamental side effect of learning in stochastic multi-armed bandits. It defines allocation variability $S_T$ and proves a sharp trade-off with regret $R_T$, showing that any sublinear regret incurs at least $\omega(\sqrt{T})$ variability and minimax regret-optimal algorithms suffer $S_T = \Theta(T)$, with a tight lower bound $R_T S_T = \Omega(T^{3/2})$. The authors introduce UCB-f, a tunable generalization of UCB1, and prove that the Pareto frontier $R_T S_T = {\tilde\Theta}(T^{3/2})$ is achievable, enabling smooth trade-offs between reward and allocation stability. They explore practical implications for platform operations and post-bandit statistical inference, including lower bounds for joint objective optimization and negative results on sampling stability under minimax-optimal learning. The work contributes a novel analytic framework linking regret, allocation patterns, and inference stability, and opens directions for contextual extensions and economic analyses of learning-driven systems.

Abstract

When multi-armed bandit (MAB) algorithms allocate pulls among competing arms, the resulting allocation can exhibit huge variation. This is particularly harmful in modern applications such as learning-enhanced platform operations and post-bandit statistical inference. Thus motivated, we introduce a new performance metric of MAB algorithms termed allocation variability, which is the largest (over arms) standard deviation of an arm's number of pulls. We establish a fundamental trade-off between allocation variability and regret, the canonical performance metric of reward maximization. In particular, for any algorithm, the worst-case regret $R_T$ and worst-case allocation variability $S_T$ must satisfy $R_T \cdot S_T=Ω(T^{\frac{3}{2}})$ as $T\rightarrow\infty$, as long as $R_T=o(T)$. This indicates that any minimax regret-optimal algorithm must incur worst-case allocation variability $Θ(T)$, the largest possible scale; while any algorithm with sublinear worst-case regret must necessarily incur ${S}_T= ω(\sqrt{T})$. We further show that this lower bound is essentially tight, and that any point on the Pareto frontier $R_T \cdot S_T=\tildeΘ(T^{3/2})$ can be achieved by a simple tunable algorithm UCB-f, a generalization of the classic UCB1. Finally, we discuss implications for platform operations and for statistical inference, when bandit algorithms are used. As a byproduct of our result, we resolve an open question of Praharaj and Khamaru (2025).

Bandit Allocational Instability

TL;DR

This work identifies allocational instability as a fundamental side effect of learning in stochastic multi-armed bandits. It defines allocation variability

and proves a sharp trade-off with regret

, showing that any sublinear regret incurs at least

variability and minimax regret-optimal algorithms suffer

, with a tight lower bound

. The authors introduce UCB-f, a tunable generalization of UCB1, and prove that the Pareto frontier

is achievable, enabling smooth trade-offs between reward and allocation stability. They explore practical implications for platform operations and post-bandit statistical inference, including lower bounds for joint objective optimization and negative results on sampling stability under minimax-optimal learning. The work contributes a novel analytic framework linking regret, allocation patterns, and inference stability, and opens directions for contextual extensions and economic analyses of learning-driven systems.

Abstract

and worst-case allocation variability

must satisfy

, as long as

. This indicates that any minimax regret-optimal algorithm must incur worst-case allocation variability

, the largest possible scale; while any algorithm with sublinear worst-case regret must necessarily incur

. We further show that this lower bound is essentially tight, and that any point on the Pareto frontier

can be achieved by a simple tunable algorithm UCB-f, a generalization of the classic UCB1. Finally, we discuss implications for platform operations and for statistical inference, when bandit algorithms are used. As a byproduct of our result, we resolve an open question of Praharaj and Khamaru (2025).

Paper Structure (40 sections, 22 theorems, 87 equations, 3 figures, 1 algorithm)

This paper contains 40 sections, 22 theorems, 87 equations, 3 figures, 1 algorithm.

Introduction
Contributions
Fundamental regret-allocation-variability trade-off.
Pareto frontier and the UCB-f algorithms.
Implications.
Novel lower bounding technique.
Literature review
Regret-minimization in MAB.
Beyond-expectation analysis of regret.
Post-bandit statistical inference.
Economic aspects of learning algorithms.
Organization of the paper
Preliminary
Additional notation
An Impossibility Theorem
...and 25 more sections

Key Result

Theorem 1

Over the instance class $\mathfrak{P}_{\mathrm{sg}}$, any active-learning algorithm must satisfy where $C>0$ is a constant that depends only on $\mathfrak{P}_{\mathrm{sg}}$ and is independent of the algorithm.

Figures (3)

Figure 1: Allocation pattern of Thompson Sampling in Example \ref{['example:intro']}.
Figure 2: Pareto frontier of the two objectives of interest.
Figure 3: Pareto frontiers of different objectives of interest.

Theorems & Definitions (39)

Example 1
Definition 1: Allocation Variability
Definition 2: Sub-Gaussian instance class
Definition 3: Active-learning algorithms
Theorem 1
Corollary 1
Corollary 2
Corollary 3
Theorem 2
Theorem 3
...and 29 more

Bandit Allocational Instability

TL;DR

Abstract

Bandit Allocational Instability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (39)