Second Order Bounds for Contextual Bandits with Function Approximation

Aldo Pacchiano

Second Order Bounds for Contextual Bandits with Function Approximation

Aldo Pacchiano

TL;DR

This paper tackles contextual bandits with function approximation under mean-reward realizability, aiming for regret that scales with the total observation noise rather than the time horizon. It develops optimistic least-squares procedures augmented with variance-aware, multi-bucket filtering to achieve second-order bounds, including both known-variance and unknown-variance regimes. By introducing online variance estimation and uncertainty filtering, the authors show regret bounds that depend on the eluder dimension and the cumulative variance, bridging gaps left by prior results that require variance observability or realizability assumptions beyond the mean reward. The results have practical implications for efficiently leveraging variance information in adaptive decision-making and point toward extensions to reinforcement learning settings. A variance-estimation framework and a thresholded, unknown-variance regression component are key contributions enabling robust performance under heteroscedastic rewards.

Abstract

Many works have developed no-regret algorithms for contextual bandits with function approximation, where the mean reward function over context-action pairs belongs to a function class. Although there are many approaches to this problem, one that has gained in importance is the use of algorithms based on the optimism principle such as optimistic least squares. It can be shown the regret of this algorithm scales as square root of the product of the eluder dimension (a statistical measure of the complexity of the function class), the logarithm of the function class size and the time horizon. Unfortunately, even if the variance of the measurement noise of the rewards at each time is changing and is very small, the regret of the optimistic least squares algorithm scales with square root of the time horizon. In this work we are the first to develop algorithms that satisfy regret bounds of scaling not with the square root of the time horizon, but the square root of the sum of the measurement variances in the setting of contextual bandits with function approximation when the variances are unknown. These bounds generalize existing techniques for deriving second order bounds in contextual linear problems.

Second Order Bounds for Contextual Bandits with Function Approximation

TL;DR

Abstract

Paper Structure (18 sections, 38 theorems, 149 equations, 3 algorithms)

This paper contains 18 sections, 38 theorems, 149 equations, 3 algorithms.

Introduction
Contributions.
Problem Definition
Optimistic Least Squares
Second Order Optimistic Least Squares with Known Variance
Contextual Bandits with Unknown Variance
Variance Estimation in Contextual Bandit Problems
Unknown-Variance Guarantees for Algorithm \ref{['alg:contextual_known_variance']}
Unknown-Variance Dependent Least Squares Regression
Conclusion
Supporting Results
Proofs of Section \ref{['section::optimistic_least_squares']}
Proofs of Section \ref{['section::second_order_known_variance']}
Proofs of Section \ref{['section::unknown_variance_contextual']}
Proofs of Section \ref{['section::estimating_variance_contextual']}
...and 3 more sections

Key Result

Theorem 2.1

Let $\delta \in (0,1)$. There exists an algorithm that achieves a regret rate of, for all $T \in \mathbb{N}$ with probability at least $1-\delta$. Where $\widetilde{\mathcal{O}}(\cdot)$ hides logarithmic dependencies.

Theorems & Definitions (61)

Definition 2.1
Definition 2.2
Theorem 2.1: Simplified
Lemma 3.0
Corollary 3.1
proof
Lemma 3.2
Theorem 3.3
proof
Proposition 4.0
...and 51 more

Second Order Bounds for Contextual Bandits with Function Approximation

TL;DR

Abstract

Second Order Bounds for Contextual Bandits with Function Approximation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (61)