Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits
Hao Qin, Chicheng Zhang
TL;DR
This paper introduces OE2D, a unified offline-regression-to-decision framework for contextual bandits with general reward-function classes, achieving near-optimal regret with $O(\log T)$ offline regression calls (and $O(\log\log T)$ when $T$ is known). A central novelty is the Decision-Offline Estimation Coefficient (DOEC), a complexity measure that quantifies the estimation burden required to reduce online learning to offline estimation, and its tight relationship with the epsilon-Sequential Extrapolation Coefficient (epsilon-SEC) and to the Decision Estimation Coefficient (DEC). The algorithm employs an exploitative F-design to balance exploitation and coverage, ensuring that the resulting regret scales favorably in large action spaces and under misspecification, corruption, or distribution shifts. Structural results tie DOEC to Eluder dimension and $h$-smoothed regret, showing that small DOEC leads to sublinear regret and bridging offline and online oracle-based approaches. Overall, OE2D unifies offline and online perspectives, improves oracle-call efficiency, and provides robust guarantees across a range of contextual bandit settings with general function classes.
Abstract
We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with $O(log(T))$ calls to an offline regression oracle over $T$ rounds, and makes $O(loglog(T))$ calls when $T$ is known. The design of OE2D algorithm generalizes Falcon~\citep{simchi2022bypassing} and its linear reward version~\citep[][Section 4]{xu2020upper} in that it chooses an action distribution that we term ``exploitative F-design'' that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and Decision Estimation Coefficient (DEC)~\citep{foster2021statistical}, bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.
