Table of Contents
Fetching ...

Optimal Batched Linear Bandits

Xuanfei Ren, Tianyuan Jin, Pan Xu

TL;DR

The paper tackles batched linear contextual bandits and introduces the E$^4$ framework (Explore-Estimate-Eliminate-Exploit) to learn efficiently under batched feedback. It proves that with a proper exploration rate, the algorithm attains minimax optimal regret at finite horizon with $O(\log\log T)$ batches and achieves asymptotic optimal regret with only 3 batches, while establishing a lower bound that at least 3 batches are needed for asymptotic optimality. A variant using a different exploration rate delivers an instance-dependent regret of $O\bigl(d\log T/\Delta_{\min}\bigr)$ with $O(\log T)$ batches, while preserving minimax and asymptotic optimality. Empirically, E$^4$ outperforms baselines on hard End of Optimism instances and randomized tests, offering strong regret performance, minimal batch complexity, and computational efficiency. Overall, E$^4$ unifies minimax and asymptotic optimality in regret with optimal batch complexity for linear bandits, marking a notable advancement for batched experimentation in sequential decision making.

Abstract

We introduce the E$^4$ algorithm for the batched linear bandit problem, incorporating an Explore-Estimate-Eliminate-Exploit framework. With a proper choice of exploration rate, we prove E$^4$ achieves the finite-time minimax optimal regret with only $O(\log\log T)$ batches, and the asymptotically optimal regret with only $3$ batches as $T\rightarrow\infty$, where $T$ is the time horizon. We further prove a lower bound on the batch complexity of linear contextual bandits showing that any asymptotically optimal algorithm must require at least $3$ batches in expectation as $T\rightarrow\infty$, which indicates E$^4$ achieves the asymptotic optimality in regret and batch complexity simultaneously. To the best of our knowledge, E$^4$ is the first algorithm for linear bandits that simultaneously achieves the minimax and asymptotic optimality in regret with the corresponding optimal batch complexities. In addition, we show that with another choice of exploration rate E$^4$ achieves an instance-dependent regret bound requiring at most $O(\log T)$ batches, and maintains the minimax optimality and asymptotic optimality. We conduct thorough experiments to evaluate our algorithm on randomly generated instances and the challenging \textit{End of Optimism} instances \citep{lattimore2017end} which were shown to be hard to learn for optimism based algorithms. Empirical results show that E$^4$ consistently outperforms baseline algorithms with respect to regret minimization, batch complexity, and computational efficiency.

Optimal Batched Linear Bandits

TL;DR

The paper tackles batched linear contextual bandits and introduces the E framework (Explore-Estimate-Eliminate-Exploit) to learn efficiently under batched feedback. It proves that with a proper exploration rate, the algorithm attains minimax optimal regret at finite horizon with batches and achieves asymptotic optimal regret with only 3 batches, while establishing a lower bound that at least 3 batches are needed for asymptotic optimality. A variant using a different exploration rate delivers an instance-dependent regret of with batches, while preserving minimax and asymptotic optimality. Empirically, E outperforms baselines on hard End of Optimism instances and randomized tests, offering strong regret performance, minimal batch complexity, and computational efficiency. Overall, E unifies minimax and asymptotic optimality in regret with optimal batch complexity for linear bandits, marking a notable advancement for batched experimentation in sequential decision making.

Abstract

We introduce the E algorithm for the batched linear bandit problem, incorporating an Explore-Estimate-Eliminate-Exploit framework. With a proper choice of exploration rate, we prove E achieves the finite-time minimax optimal regret with only batches, and the asymptotically optimal regret with only batches as , where is the time horizon. We further prove a lower bound on the batch complexity of linear contextual bandits showing that any asymptotically optimal algorithm must require at least batches in expectation as , which indicates E achieves the asymptotic optimality in regret and batch complexity simultaneously. To the best of our knowledge, E is the first algorithm for linear bandits that simultaneously achieves the minimax and asymptotic optimality in regret with the corresponding optimal batch complexities. In addition, we show that with another choice of exploration rate E achieves an instance-dependent regret bound requiring at most batches, and maintains the minimax optimality and asymptotic optimality. We conduct thorough experiments to evaluate our algorithm on randomly generated instances and the challenging \textit{End of Optimism} instances \citep{lattimore2017end} which were shown to be hard to learn for optimism based algorithms. Empirical results show that E consistently outperforms baseline algorithms with respect to regret minimization, batch complexity, and computational efficiency.
Paper Structure (33 sections, 14 theorems, 84 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 33 sections, 14 theorems, 84 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Lemma 3.1

Any consistent algorithm $\pi$ for the linear bandits setting with Gaussian noise has regret $R_T$ satisfying $\liminf_{T\rightarrow\infty}R_T/\log T\geq c^*$, where $c^*$ is given by eq:program.

Figures (6)

  • Figure 1: The End of Optimism instance in $\mathbb{R}^2$. The true parameter $\bm{\theta}^*$ is $(1,0)$. The arms are $\mathbf{x}_1=(1,0),\mathbf{x}_2=(0,1),\mathbf{x}_3=(1-\varepsilon,2\varepsilon)$. Note that $\mathbf{x}_i$ is the best arm if $\bm{\theta}^*$ lies in the colored region $C_i$, $i=1,2,3$.
  • Figure 2: Regret and Batch Analysis: End of Optimism instances ($d=5,K=9$).
  • Figure 3: Regret and Batch Analysis: End of Optimism instances ($d=2,K=3$).
  • Figure 4: Regret and Batch Analysis: End of Optimism instances ($d=3,K=5$).
  • Figure 5: Ablation study on the parameter $\epsilon$.
  • ...and 1 more figures

Theorems & Definitions (25)

  • Lemma 3.1: lattimore2017end
  • Definition 4.1
  • Remark 4.2
  • Definition 4.3
  • Remark 4.4
  • Definition 5.1
  • Theorem 5.2
  • Remark 5.3
  • Theorem 5.4
  • Remark 5.5
  • ...and 15 more