Optimal Regret for Policy Optimization in Contextual Bandits

Orin Levy; Yishay Mansour

Optimal Regret for Policy Optimization in Contextual Bandits

Orin Levy, Yishay Mansour

TL;DR

This work tackles regret minimization for stochastic contextual CMABs under general offline function approximation by introducing OPO-CMAB, a policy-optimization-based algorithm with an optimistic exploration framework. The method augments PO updates with counterfactual exploration bonuses and relies on an offline regression oracle over a realizable loss class $\\mathcal{F}$ to predict losses. The authors prove a high-probability regret bound of $\\widetilde{O}(\\sqrt{K|\\mathcal{A}| \\\log |\\mathcal{F}|})$, matching the minimax rate up to polylog factors, and provide a thorough regret analysis combining Azuma-Hoeffding, uniform convergence of offline LS regression, and Online Mirror Descent with KL-divergence. Empirically, OPO-CMAB is competitive with state-of-the-art CMAB baselines on the VW benchmark, demonstrating practical viability alongside rigorous guarantees. This work bridges theory and practice by showing that policy-optimization methods can attain provably optimal regret in CMAB settings with offline function approximation, and it suggests directions for scalable extensions to broader contextual RL problems.

Abstract

We present the first high-probability optimal regret bound for a policy optimization technique applied to the problem of stochastic contextual multi-armed bandit (CMAB) with general offline function approximation. Our algorithm is both efficient and achieves an optimal regret bound of $\widetilde{O}(\sqrt{ K|\mathcal{A}|\log|\mathcal{F}|})$, where $K$ is the number of rounds, $\mathcal{A}$ is the set of arms, and $\mathcal{F}$ is the function class used to approximate the losses. Our results bridge the gap between theory and practice, demonstrating that the widely used policy optimization methods for the contextual bandit problem can achieve a rigorously-proved optimal regret bound. We support our theoretical results with an empirical evaluation of our algorithm.

Optimal Regret for Policy Optimization in Contextual Bandits

TL;DR

to predict losses. The authors prove a high-probability regret bound of

, matching the minimax rate up to polylog factors, and provide a thorough regret analysis combining Azuma-Hoeffding, uniform convergence of offline LS regression, and Online Mirror Descent with KL-divergence. Empirically, OPO-CMAB is competitive with state-of-the-art CMAB baselines on the VW benchmark, demonstrating practical viability alongside rigorous guarantees. This work bridges theory and practice by showing that policy-optimization methods can attain provably optimal regret in CMAB settings with offline function approximation, and it suggests directions for scalable extensions to broader contextual RL problems.

Abstract

, where

is the number of rounds,

is the set of arms, and

is the function class used to approximate the losses. Our results bridge the gap between theory and practice, demonstrating that the widely used policy optimization methods for the contextual bandit problem can achieve a rigorously-proved optimal regret bound. We support our theoretical results with an empirical evaluation of our algorithm.

Paper Structure (20 sections, 16 theorems, 52 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 16 theorems, 52 equations, 2 figures, 2 tables, 1 algorithm.

Introduction
Related Literature Review
Preliminaries and Notations
Offline Function Approximation
Algorithm and Main Result
Regret Analysis
Experiments
Conclusions and Discussion
Proofs: Regret Analysis
Auxiliary Lemmas
Online Mirror Descent
Concentration Inequalities
Oracle Convergence
Additional Algebraic Lemmas
Experiments
...and 5 more sections

Key Result

Theorem 3.1

For an appropriate choice of $\beta, \eta$, with probability at least $1-\delta$,

Figures (2)

Figure 1: Averaged PV loss in Datasets 1084, 1062, 1015.
Figure 2: Mean Difference from Supervised baseline.

Theorems & Definitions (27)

Theorem 3.1: Regret bound
Corollary 4.1: uniform convergence of offline least-squares regression
Lemma 4.2: Bonuses bound
Lemma 4.3: Term (\ref{['reg:term-i']}) bound
proof : Proof sketch
Lemma 4.4: Term (\ref{['reg:term-ii']}) bound
proof : Proof sketch
Lemma 4.5: Term (\ref{['reg:term-iii']}) bound
proof : Proof of \ref{['thm:regret']}
Lemma 1.1: Bonuses bound, restatement of \ref{['lemma:sum-of-bonuses-main']}
...and 17 more

Optimal Regret for Policy Optimization in Contextual Bandits

TL;DR

Abstract

Optimal Regret for Policy Optimization in Contextual Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (27)