Optimal Batched Linear Bandits

Xuanfei Ren; Tianyuan Jin; Pan Xu

Optimal Batched Linear Bandits

Xuanfei Ren, Tianyuan Jin, Pan Xu

TL;DR

The paper tackles batched linear contextual bandits and introduces the E$^4$ framework (Explore-Estimate-Eliminate-Exploit) to learn efficiently under batched feedback. It proves that with a proper exploration rate, the algorithm attains minimax optimal regret at finite horizon with $O(\log\log T)$ batches and achieves asymptotic optimal regret with only 3 batches, while establishing a lower bound that at least 3 batches are needed for asymptotic optimality. A variant using a different exploration rate delivers an instance-dependent regret of $O\bigl(d\log T/\Delta_{\min}\bigr)$ with $O(\log T)$ batches, while preserving minimax and asymptotic optimality. Empirically, E$^4$ outperforms baselines on hard End of Optimism instances and randomized tests, offering strong regret performance, minimal batch complexity, and computational efficiency. Overall, E$^4$ unifies minimax and asymptotic optimality in regret with optimal batch complexity for linear bandits, marking a notable advancement for batched experimentation in sequential decision making.

Abstract

We introduce the E$^4$ algorithm for the batched linear bandit problem, incorporating an Explore-Estimate-Eliminate-Exploit framework. With a proper choice of exploration rate, we prove E$^4$ achieves the finite-time minimax optimal regret with only $O(\log\log T)$ batches, and the asymptotically optimal regret with only $3$ batches as $T\rightarrow\infty$, where $T$ is the time horizon. We further prove a lower bound on the batch complexity of linear contextual bandits showing that any asymptotically optimal algorithm must require at least $3$ batches in expectation as $T\rightarrow\infty$, which indicates E$^4$ achieves the asymptotic optimality in regret and batch complexity simultaneously. To the best of our knowledge, E$^4$ is the first algorithm for linear bandits that simultaneously achieves the minimax and asymptotic optimality in regret with the corresponding optimal batch complexities. In addition, we show that with another choice of exploration rate E$^4$ achieves an instance-dependent regret bound requiring at most $O(\log T)$ batches, and maintains the minimax optimality and asymptotic optimality. We conduct thorough experiments to evaluate our algorithm on randomly generated instances and the challenging \textit{End of Optimism} instances \citep{lattimore2017end} which were shown to be hard to learn for optimism based algorithms. Empirical results show that E$^4$ consistently outperforms baseline algorithms with respect to regret minimization, batch complexity, and computational efficiency.

Optimal Batched Linear Bandits

TL;DR

The paper tackles batched linear contextual bandits and introduces the E

framework (Explore-Estimate-Eliminate-Exploit) to learn efficiently under batched feedback. It proves that with a proper exploration rate, the algorithm attains minimax optimal regret at finite horizon with

batches and achieves asymptotic optimal regret with only 3 batches, while establishing a lower bound that at least 3 batches are needed for asymptotic optimality. A variant using a different exploration rate delivers an instance-dependent regret of

with

batches, while preserving minimax and asymptotic optimality. Empirically, E

outperforms baselines on hard End of Optimism instances and randomized tests, offering strong regret performance, minimal batch complexity, and computational efficiency. Overall, E

unifies minimax and asymptotic optimality in regret with optimal batch complexity for linear bandits, marking a notable advancement for batched experimentation in sequential decision making.

Abstract

We introduce the E

algorithm for the batched linear bandit problem, incorporating an Explore-Estimate-Eliminate-Exploit framework. With a proper choice of exploration rate, we prove E

achieves the finite-time minimax optimal regret with only

batches, and the asymptotically optimal regret with only

batches as

, where

is the time horizon. We further prove a lower bound on the batch complexity of linear contextual bandits showing that any asymptotically optimal algorithm must require at least

batches in expectation as

, which indicates E

achieves the asymptotic optimality in regret and batch complexity simultaneously. To the best of our knowledge, E

is the first algorithm for linear bandits that simultaneously achieves the minimax and asymptotic optimality in regret with the corresponding optimal batch complexities. In addition, we show that with another choice of exploration rate E

achieves an instance-dependent regret bound requiring at most

batches, and maintains the minimax optimality and asymptotic optimality. We conduct thorough experiments to evaluate our algorithm on randomly generated instances and the challenging \textit{End of Optimism} instances \citep{lattimore2017end} which were shown to be hard to learn for optimism based algorithms. Empirical results show that E

consistently outperforms baseline algorithms with respect to regret minimization, batch complexity, and computational efficiency.

Paper Structure (33 sections, 14 theorems, 84 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 33 sections, 14 theorems, 84 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Notation
Related Work
Preliminary
Asymptotic lower bound
Least square estimators
Optimal Batched Linear Bandits Algorithm
Batch $\ell=1$:
Batch $\ell=2$:
Batch $\ell\geq3$:
Final batch:
Theoretical Analysis
Experiments
Conclusion and Future Work
Additional Experiments
...and 18 more sections

Key Result

Lemma 3.1

Any consistent algorithm $\pi$ for the linear bandits setting with Gaussian noise has regret $R_T$ satisfying $\liminf_{T\rightarrow\infty}R_T/\log T\geq c^*$, where $c^*$ is given by eq:program.

Figures (6)

Figure 1: The End of Optimism instance in $\mathbb{R}^2$. The true parameter $\bm{\theta}^*$ is $(1,0)$. The arms are $\mathbf{x}_1=(1,0),\mathbf{x}_2=(0,1),\mathbf{x}_3=(1-\varepsilon,2\varepsilon)$. Note that $\mathbf{x}_i$ is the best arm if $\bm{\theta}^*$ lies in the colored region $C_i$, $i=1,2,3$.
Figure 2: Regret and Batch Analysis: End of Optimism instances ($d=5,K=9$).
Figure 3: Regret and Batch Analysis: End of Optimism instances ($d=2,K=3$).
Figure 4: Regret and Batch Analysis: End of Optimism instances ($d=3,K=5$).
Figure 5: Ablation study on the parameter $\epsilon$.
...and 1 more figures

Theorems & Definitions (25)

Lemma 3.1: lattimore2017end
Definition 4.1
Remark 4.2
Definition 4.3
Remark 4.4
Definition 5.1
Theorem 5.2
Remark 5.3
Theorem 5.4
Remark 5.5
...and 15 more

Optimal Batched Linear Bandits

TL;DR

Abstract

Optimal Batched Linear Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (25)