Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits

Hao Qin; Chicheng Zhang

Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits

Hao Qin, Chicheng Zhang

TL;DR

This paper introduces OE2D, a unified offline-regression-to-decision framework for contextual bandits with general reward-function classes, achieving near-optimal regret with $O(\log T)$ offline regression calls (and $O(\log\log T)$ when $T$ is known). A central novelty is the Decision-Offline Estimation Coefficient (DOEC), a complexity measure that quantifies the estimation burden required to reduce online learning to offline estimation, and its tight relationship with the epsilon-Sequential Extrapolation Coefficient (epsilon-SEC) and to the Decision Estimation Coefficient (DEC). The algorithm employs an exploitative F-design to balance exploitation and coverage, ensuring that the resulting regret scales favorably in large action spaces and under misspecification, corruption, or distribution shifts. Structural results tie DOEC to Eluder dimension and $h$-smoothed regret, showing that small DOEC leads to sublinear regret and bridging offline and online oracle-based approaches. Overall, OE2D unifies offline and online perspectives, improves oracle-call efficiency, and provides robust guarantees across a range of contextual bandit settings with general function classes.

Abstract

We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with $O(log(T))$ calls to an offline regression oracle over $T$ rounds, and makes $O(loglog(T))$ calls when $T$ is known. The design of OE2D algorithm generalizes Falcon~\citep{simchi2022bypassing} and its linear reward version~\citep[][Section 4]{xu2020upper} in that it chooses an action distribution that we term ``exploitative F-design'' that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and Decision Estimation Coefficient (DEC)~\citep{foster2021statistical}, bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.

Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits

TL;DR

This paper introduces OE2D, a unified offline-regression-to-decision framework for contextual bandits with general reward-function classes, achieving near-optimal regret with

offline regression calls (and

when

is known). A central novelty is the Decision-Offline Estimation Coefficient (DOEC), a complexity measure that quantifies the estimation burden required to reduce online learning to offline estimation, and its tight relationship with the epsilon-Sequential Extrapolation Coefficient (epsilon-SEC) and to the Decision Estimation Coefficient (DEC). The algorithm employs an exploitative F-design to balance exploitation and coverage, ensuring that the resulting regret scales favorably in large action spaces and under misspecification, corruption, or distribution shifts. Structural results tie DOEC to Eluder dimension and

-smoothed regret, showing that small DOEC leads to sublinear regret and bridging offline and online oracle-based approaches. Overall, OE2D unifies offline and online perspectives, improves oracle-call efficiency, and provides robust guarantees across a range of contextual bandit settings with general function classes.

Abstract

calls to an offline regression oracle over

rounds, and makes

calls when

is known. The design of OE2D algorithm generalizes Falcon~\citep{simchi2022bypassing} and its linear reward version~\citep[][Section 4]{xu2020upper} in that it chooses an action distribution that we term ``exploitative F-design'' that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and Decision Estimation Coefficient (DEC)~\citep{foster2021statistical}, bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.

Paper Structure (55 sections, 28 theorems, 112 equations, 1 table, 4 algorithms)

This paper contains 55 sections, 28 theorems, 112 equations, 1 table, 4 algorithms.

Introduction
Related Work
Contextual Bandits
Contextual Bandits with Online Regression Oracles
Contextual Bandits with Offline Regression Oracles
Experimental Design
Preliminaries
Basic Notations
Basic Assumptions
Main Performance Measure: Regret
Running Examples.
Regression Oracles
Decision-Estimation Coefficient
Coverage between Distributions
The $\textsc{OE2D}$ contextual bandit algorithm and its guarantees
...and 40 more sections

Key Result

Lemma 1

$p_t \in \mathrm{co}(\Lambda)$ satisfies the following two properties simultaneously:

Theorems & Definitions (36)

Definition 1: DEC
Definition 2: DOEC
Lemma 1
Theorem 1
Remark 2: OE2D with inexact minimizers
Definition 3: $\varepsilon$-SEC
Theorem 3
Proposition 1
Proposition 2
Theorem 4
...and 26 more

Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits

TL;DR

Abstract

Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (36)