Second Order Bounds for Contextual Bandits with Function Approximation
Aldo Pacchiano
TL;DR
This paper tackles contextual bandits with function approximation under mean-reward realizability, aiming for regret that scales with the total observation noise rather than the time horizon. It develops optimistic least-squares procedures augmented with variance-aware, multi-bucket filtering to achieve second-order bounds, including both known-variance and unknown-variance regimes. By introducing online variance estimation and uncertainty filtering, the authors show regret bounds that depend on the eluder dimension and the cumulative variance, bridging gaps left by prior results that require variance observability or realizability assumptions beyond the mean reward. The results have practical implications for efficiently leveraging variance information in adaptive decision-making and point toward extensions to reinforcement learning settings. A variance-estimation framework and a thresholded, unknown-variance regression component are key contributions enabling robust performance under heteroscedastic rewards.
Abstract
Many works have developed no-regret algorithms for contextual bandits with function approximation, where the mean reward function over context-action pairs belongs to a function class. Although there are many approaches to this problem, one that has gained in importance is the use of algorithms based on the optimism principle such as optimistic least squares. It can be shown the regret of this algorithm scales as square root of the product of the eluder dimension (a statistical measure of the complexity of the function class), the logarithm of the function class size and the time horizon. Unfortunately, even if the variance of the measurement noise of the rewards at each time is changing and is very small, the regret of the optimistic least squares algorithm scales with square root of the time horizon. In this work we are the first to develop algorithms that satisfy regret bounds of scaling not with the square root of the time horizon, but the square root of the sum of the measurement variances in the setting of contextual bandits with function approximation when the variances are unknown. These bounds generalize existing techniques for deriving second order bounds in contextual linear problems.
