Optimal Regret for Policy Optimization in Contextual Bandits
Orin Levy, Yishay Mansour
TL;DR
This work tackles regret minimization for stochastic contextual CMABs under general offline function approximation by introducing OPO-CMAB, a policy-optimization-based algorithm with an optimistic exploration framework. The method augments PO updates with counterfactual exploration bonuses and relies on an offline regression oracle over a realizable loss class $\\mathcal{F}$ to predict losses. The authors prove a high-probability regret bound of $\\widetilde{O}(\\sqrt{K|\\mathcal{A}| \\\log |\\mathcal{F}|})$, matching the minimax rate up to polylog factors, and provide a thorough regret analysis combining Azuma-Hoeffding, uniform convergence of offline LS regression, and Online Mirror Descent with KL-divergence. Empirically, OPO-CMAB is competitive with state-of-the-art CMAB baselines on the VW benchmark, demonstrating practical viability alongside rigorous guarantees. This work bridges theory and practice by showing that policy-optimization methods can attain provably optimal regret in CMAB settings with offline function approximation, and it suggests directions for scalable extensions to broader contextual RL problems.
Abstract
We present the first high-probability optimal regret bound for a policy optimization technique applied to the problem of stochastic contextual multi-armed bandit (CMAB) with general offline function approximation. Our algorithm is both efficient and achieves an optimal regret bound of $\widetilde{O}(\sqrt{ K|\mathcal{A}|\log|\mathcal{F}|})$, where $K$ is the number of rounds, $\mathcal{A}$ is the set of arms, and $\mathcal{F}$ is the function class used to approximate the losses. Our results bridge the gap between theory and practice, demonstrating that the widely used policy optimization methods for the contextual bandit problem can achieve a rigorously-proved optimal regret bound. We support our theoretical results with an empirical evaluation of our algorithm.
