Table of Contents
Fetching ...

How Does Variance Shape the Regret in Contextual Bandits?

Zeyu Jia, Jian Qian, Alexander Rakhlin, Chen-Yu Wei

TL;DR

It is proved that the eluder dimension of the function class, a complexity measure of the function class, plays a crucial role in variance-dependent bounds, and it is demonstrated that the regret bound $\tilde{O}(\sqrt{d_\text{elu}\Lambda}+d_\text{elu}\leq\sqrt{AT}$.

Abstract

We consider realizable contextual bandits with general function approximation, investigating how small reward variance can lead to better-than-minimax regret bounds. Unlike in minimax bounds, we show that the eluder dimension $d_\text{elu}$$-$a complexity measure of the function class$-$plays a crucial role in variance-dependent bounds. We consider two types of adversary: (1) Weak adversary: The adversary sets the reward variance before observing the learner's action. In this setting, we prove that a regret of $Ω(\sqrt{\min\{A,d_\text{elu}\}Λ}+d_\text{elu})$ is unavoidable when $d_{\text{elu}}\leq\sqrt{AT}$, where $A$ is the number of actions, $T$ is the total number of rounds, and $Λ$ is the total variance over $T$ rounds. For the $A\leq d_\text{elu}$ regime, we derive a nearly matching upper bound $\tilde{O}(\sqrt{AΛ}+d_\text{elu})$ for the special case where the variance is revealed at the beginning of each round. (2) Strong adversary: The adversary sets the reward variance after observing the learner's action. We show that a regret of $Ω(\sqrt{d_\text{elu}Λ}+d_\text{elu})$ is unavoidable when $\sqrt{d_\text{elu}Λ}+d_\text{elu}\leq\sqrt{AT}$. In this setting, we provide an upper bound of order $\tilde{O}(d_\text{elu}\sqrtΛ+d_\text{elu})$. Furthermore, we examine the setting where the function class additionally provides distributional information of the reward, as studied by Wang et al. (2024). We demonstrate that the regret bound $\tilde{O}(\sqrt{d_\text{elu}Λ}+d_\text{elu})$ established in their work is unimprovable when $\sqrt{d_{\text{elu}}Λ}+d_\text{elu}\leq\sqrt{AT}$. However, with a slightly different definition of the total variance and with the assumption that the reward follows a Gaussian distribution, one can achieve a regret of $\tilde{O}(\sqrt{AΛ}+d_\text{elu})$.

How Does Variance Shape the Regret in Contextual Bandits?

TL;DR

It is proved that the eluder dimension of the function class, a complexity measure of the function class, plays a crucial role in variance-dependent bounds, and it is demonstrated that the regret bound .

Abstract

We consider realizable contextual bandits with general function approximation, investigating how small reward variance can lead to better-than-minimax regret bounds. Unlike in minimax bounds, we show that the eluder dimension a complexity measure of the function classplays a crucial role in variance-dependent bounds. We consider two types of adversary: (1) Weak adversary: The adversary sets the reward variance before observing the learner's action. In this setting, we prove that a regret of is unavoidable when , where is the number of actions, is the total number of rounds, and is the total variance over rounds. For the regime, we derive a nearly matching upper bound for the special case where the variance is revealed at the beginning of each round. (2) Strong adversary: The adversary sets the reward variance after observing the learner's action. We show that a regret of is unavoidable when . In this setting, we provide an upper bound of order . Furthermore, we examine the setting where the function class additionally provides distributional information of the reward, as studied by Wang et al. (2024). We demonstrate that the regret bound established in their work is unimprovable when . However, with a slightly different definition of the total variance and with the assumption that the reward follows a Gaussian distribution, one can achieve a regret of .

Paper Structure

This paper contains 49 sections, 41 theorems, 211 equations, 1 table, 6 algorithms.

Key Result

Theorem 1

For any integer $d,A\geq 2$, any positive real number $\sigma\in [0,1]$, and time $T>0$, there exists a context space $\mathcal{X}$ and a contextual bandit problem $\mathcal{F}\subset (\mathcal{X}\times\mathcal{A}\to \mathbb{R})$ with eluder dimension $d_{\text{elu}}(0) \leq d$, action set $\mathca

Theorems & Definitions (43)

  • Definition 2.1: Eluder Dimension russo2014learning
  • Theorem 1: Main lower bound
  • Theorem 2
  • Corollary 4.1
  • Theorem 3
  • Theorem 4
  • Definition 6.1: Hellinger Eluder Dimension
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • ...and 33 more