Linear bandits with polylogarithmic minimax regret

Josep Lumbreras; Marco Tomamichel

Linear bandits with polylogarithmic minimax regret

Josep Lumbreras, Marco Tomamichel

TL;DR

This work introduces a linear bandit model with vanishing noise, where the noise variance satisfies $\sigma_t^2 \le 1-\langle\theta,a_t\rangle^2$, and proposes LinUCB-VN, a weighted regularized least-squares approach that preserves optimism while adapting to decreasing noise. A geometric action-scheme ensures the design matrix satisfies $\lambda_{\min}(V_t)=\Omega(\sqrt{\lambda_{\max}(V_t)})$, enabling instantaneous regret $\sim 1/t$ and a cumulative polylogarithmic regret of $O(d^4\log^3 T)$. The analysis combines a weighted confidence region, a batch-based action update, and a novel eigenvalue-growth argument that is robust to the noise model. The paper also establishes a minimax lower bound for the constant-noise setting and discusses the limitations of standard lower-bound techniques under vanishing noise, underscoring the novelty and potential of the proposed approach in quantum tomography and other applications where measurement noise decays with alignment to the unknown parameter.

Abstract

We study a noise model for linear stochastic bandits for which the subgaussian noise parameter vanishes linearly as we select actions on the unit sphere closer and closer to the unknown vector. We introduce an algorithm for this problem that exhibits a minimax regret scaling as $\log^3(T)$ in the time horizon $T$, in stark contrast the square root scaling of this regret for typical bandit algorithms. Our strategy, based on weighted least-squares estimation, achieves the eigenvalue relation $λ_{\min} ( V_t ) = Ω(\sqrt{λ_{\max}(V_t ) })$ for the design matrix $V_t$ at each time step $t$ through geometrical arguments that are independent of the noise model and might be of independent interest. This allows us to tightly control the expected regret in each time step to be of the order $O(\frac1{t})$, leading to the logarithmic scaling of the cumulative regret.

Linear bandits with polylogarithmic minimax regret

TL;DR

This work introduces a linear bandit model with vanishing noise, where the noise variance satisfies

, and proposes LinUCB-VN, a weighted regularized least-squares approach that preserves optimism while adapting to decreasing noise. A geometric action-scheme ensures the design matrix satisfies

, enabling instantaneous regret

and a cumulative polylogarithmic regret of

. The analysis combines a weighted confidence region, a batch-based action update, and a novel eigenvalue-growth argument that is robust to the noise model. The paper also establishes a minimax lower bound for the constant-noise setting and discusses the limitations of standard lower-bound techniques under vanishing noise, underscoring the novelty and potential of the proposed approach in quantum tomography and other applications where measurement noise decays with alignment to the unknown parameter.

Abstract

in the time horizon

, in stark contrast the square root scaling of this regret for typical bandit algorithms. Our strategy, based on weighted least-squares estimation, achieves the eigenvalue relation

for the design matrix

at each time step

through geometrical arguments that are independent of the noise model and might be of independent interest. This allows us to tightly control the expected regret in each time step to be of the order

, leading to the logarithmic scaling of the cumulative regret.

Paper Structure (16 sections, 12 theorems, 214 equations, 3 figures, 1 algorithm)

This paper contains 16 sections, 12 theorems, 214 equations, 3 figures, 1 algorithm.

Introduction
Notation and model
Weighted regularized least squares estimator and confidence region
Algorithm for linear bandits with vanishing noise: LinUCB-VN
Actions and eigenvalue analysis of design matrix
Regret analysis
Open problems
Proofs of Section 3
Proof of Lemma \ref{['lem:confidence_region_weighted']}
Proofs of Section 5
Proof of Theorem \ref{['th:main']}
Alternative proof for special case $d=2$
Proofs of Section 6
Proof of Theorem \ref{['th:regret_bound_d2']}
Minimax lower bound for linear bandit $\mathcal{A} = \mathbb{S}^d, \theta\in\mathbb{S}^d$ and constant noise
...and 1 more sections

Key Result

Theorem 1

For any $T \in \mathbb{N}$ there exists an instance of LinUCB-VN such that, for any $\theta \in \mathcal{E}$, we have

Figures (3)

Figure 1: Scheme for the choice of actions $a_t^+,a_t^-$ of LinUCB-VN. The actions are selected as the projections of the extremal points across the largest axis of the confidence region centered around a weighted least squares estimator $\widetilde{\theta}^{\text{w}}_t$ of the unknown parameter $\theta$. This choice is sufficient to increase the minimum eigenvalue of $V_t$ such that the relation $\lambda_{\min}(V_t) = \Omega (\sqrt{\lambda_{\max}(V_t)})$ is satisfied. Moreover, the actions $a_t^+$ and $a_t^-$ are sufficiently close to $\theta$ to keep the regret small.
Figure 2: We numerically test LinUCB and LinUCB-VN in a linear bandit with action set $\mathcal{A} = \mathbb{S}^2$ and reward model $r_t = \mathcal{N}(\langle \theta , a_t \rangle , 1 - \langle \theta , a_t \rangle^2 )$. Each point in the graphic is run independently and averaged over 100 instances for random environments $\theta\in\mathbb{S}^2$. Left plot: Scaling of the regret for LinUCB algorithm and LinUCB-VN. We fit the functions $R(t) = 1.86\log^2 t$ for LinUCB-VN and $R(t) = 0.88\sqrt{t\log t}$ for LinUCB. Right plots: Scaling of the maximum and minimum eigenvalue of the matrix $V_t$ for LinUCB-VN. The scaling shows the relation $\lambda_{\min} = \Omega ( \sqrt{\lambda_{\max}} )$. We fit the function $\lambda_{\min}(V_t) = 0.2059t$ for the minimum eigenvalue and $\lambda_{\max}(V_t) = 0.0012t^2$ for the maximum eigenvalue. The behavior $\lambda_{\min}(V_t) = \Theta ( t )$ is the one that gives us the theoretical guarantee of polylogarithmic scaling of the regret.
Figure 3: Sketch for the triangle inequality used to bound $\| \theta - a^\pm_{t,i} \|_2$ in \ref{['eq:theta_at_bound']}. The red lines represent the distances $(i),(ii)$ and $(iii)$. Under the event $\theta\in\mathcal{C}_{t-1}$ we can use $\mathcal{C}_{t-1} \subseteq \mathbb{B}^d_r (\tilde{\theta}^w_{t-1} )$ with $r$ being the longest axis of the ellipsoid and bound all distances by the diameter $2r$.

Theorems & Definitions (12)

Theorem 1: main result, informal version
Lemma 2
Theorem 3
Theorem 4
Theorem 6: Theorem 3.9 in lattimore_szepesvári_2020
Corollary 7: Corollary III.1.2 in bhatia97
Theorem 8
Lemma 9
Theorem 10: MichelPetrovitch1901
Corollary 11
...and 2 more

Linear bandits with polylogarithmic minimax regret

TL;DR

Abstract

Linear bandits with polylogarithmic minimax regret

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (12)