Variance-Dependent Regret Lower Bounds for Contextual Bandits

Jiafan He; Quanquan Gu

Variance-Dependent Regret Lower Bounds for Contextual Bandits

Jiafan He, Quanquan Gu

TL;DR

This work establishes variance-dependent regret lower bounds for linear contextual bandits in two key settings: a prefixed variance sequence and an adaptive variance sequence with a weak adversary. It introduces intricate constructions and proof techniques (peeling, multi-instance, and high-probability boosting) to show that, under prefixed sequences, the expected regret is at least $\Omega(d\sqrt{\sum_{k=1}^K \sigma_k^2}/\log K)$, aligning with variance-aware upper bounds up to logarithms. For adaptive sequences, a high-probability lower bound of $\Omega(d\sqrt{\sum_{k=1}^K \sigma_k^2}/\log^6(dK))$ holds under a weak adversary, while a strong-adversary model demonstrates that no general variance-dependent lower bound can exist. The results collectively bridge the gap between lower and upper bounds in the variance-aware linear bandit setting and delineate fundamental limitations when adversaries can tailor variance after observing the decision sets. The findings have implications for designing variance-aware algorithms (like SAVE) and for understanding the inherent difficulty of learning under heteroscedastic rewards.

Abstract

Variance-dependent regret bounds for linear contextual bandits, which improve upon the classical $\tilde{O}(d\sqrt{K})$ regret bound to $\tilde{O}(d\sqrt{\sum_{k=1}^Kσ_k^2})$, where $d$ is the context dimension, $K$ is the number of rounds, and $σ^2_k$ is the noise variance in round $k$, has been widely studied in recent years. However, most existing works focus on the regret upper bounds instead of lower bounds. To our knowledge, the only lower bound is from Jia et al. (2024), which proved that for any eluder dimension $d_{\textbf{elu}}$ and total variance budget $Λ$, there exists an instance with $\sum_{k=1}^Kσ_k^2\leq Λ$ for which any algorithm incurs a variance-dependent lower bound of $Ω(\sqrt{d_{\textbf{elu}}Λ})$. However, this lower bound has a $\sqrt{d}$ gap with existing upper bounds. Moreover, it only considers a fixed total variance budget $Λ$ and does not apply to a general variance sequence $\{σ_1^2,\ldots,σ_K^2\}$. In this paper, to overcome the limitations of Jia et al. (2024), we consider the general variance sequence under two settings. For a prefixed sequence, where the entire variance sequence is revealed to the learner at the beginning of the learning process, we establish a variance-dependent lower bound of $Ω(d \sqrt{\sum_{k=1}^Kσ_k^2 }/\log K)$ for linear contextual bandits. For an adaptive sequence, where an adversary can generate the variance $σ_k^2$ in each round $k$ based on historical observations, we show that when the adversary must generate $σ_k^2$ before observing the decision set $\mathcal{D}_k$, a similar lower bound of $Ω(d\sqrt{ \sum_{k=1}^Kσ_k^2} /\log^6(dK))$ holds. In both settings, our results match the upper bounds of the SAVE algorithm (Zhao et al., 2023) up to logarithmic factors.

Variance-Dependent Regret Lower Bounds for Contextual Bandits

TL;DR

, aligning with variance-aware upper bounds up to logarithms. For adaptive sequences, a high-probability lower bound of

holds under a weak adversary, while a strong-adversary model demonstrates that no general variance-dependent lower bound can exist. The results collectively bridge the gap between lower and upper bounds in the variance-aware linear bandit setting and delineate fundamental limitations when adversaries can tailor variance after observing the decision sets. The findings have implications for designing variance-aware algorithms (like SAVE) and for understanding the inherent difficulty of learning under heteroscedastic rewards.

Abstract

Variance-dependent regret bounds for linear contextual bandits, which improve upon the classical

regret bound to

, where

is the context dimension,

is the number of rounds, and

is the noise variance in round

, has been widely studied in recent years. However, most existing works focus on the regret upper bounds instead of lower bounds. To our knowledge, the only lower bound is from Jia et al. (2024), which proved that for any eluder dimension

and total variance budget

, there exists an instance with

for which any algorithm incurs a variance-dependent lower bound of

. However, this lower bound has a

gap with existing upper bounds. Moreover, it only considers a fixed total variance budget

and does not apply to a general variance sequence

. In this paper, to overcome the limitations of Jia et al. (2024), we consider the general variance sequence under two settings. For a prefixed sequence, where the entire variance sequence is revealed to the learner at the beginning of the learning process, we establish a variance-dependent lower bound of

for linear contextual bandits. For an adaptive sequence, where an adversary can generate the variance

in each round

based on historical observations, we show that when the adversary must generate

before observing the decision set

, a similar lower bound of

holds. In both settings, our results match the upper bounds of the SAVE algorithm (Zhao et al., 2023) up to logarithmic factors.

Variance-Dependent Regret Lower Bounds for Contextual Bandits

TL;DR

Abstract

Variance-Dependent Regret Lower Bounds for Contextual Bandits

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (21)