Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback
Orin Levy, Liad Erez, Alon Cohen, Yishay Mansour
TL;DR
The paper addresses regret minimization for adversarial contextual bandits under adversarial delayed feedback, analyzing two frameworks: (i) policy-class learning with a finite policy set and (ii) online function approximation with a realizable loss class accessed via an online regression oracle. It introduces EXP4-DALE for the policy-class setting, achieving a near-optimal regret of $O(\sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|})$, and DAFA for function approximation, yielding $O(\sqrt{KT\mathcal{R}_T(\mathcal{O})} + \sqrt{d_{\max} D \beta})$ under FIFO delays and a stability parameter $\beta$. A Hedge-based Vovk aggregating forecaster is analyzed to provide a stable, finite-class online regression oracle with $\beta \le O(\log |\mathcal{F}|)$, leading to a regret bound $O(\sqrt{KT \log |\mathcal{F}|} + \sqrt{d_{\max} D \log |\mathcal{F}|})$. Complementary lower bounds show that the policy-class bound is optimal up to log factors, while the function-approximation bound is tight up to a $\sqrt{d_{\max}}$ factor, revealing an important role for oracle stability and FIFO-delivery assumptions in delayed CMAB with general function approximation.
Abstract
We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $Π$ we establish an optimal expected regret bound of $ O (\sqrt{KT \log |Π|} + \sqrt{D \log |Π|)} $ where $D$ is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \mathcal{F} $ with access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$. In this setting, we achieve an expected regret bound of $O(\sqrt{KT\mathcal{R}_T(\mathcal{O})} + \sqrt{ d_{\max} D β})$ assuming FIFO order, where $d_{\max}$ is the maximal delay, $\mathcal{R}_T(\mathcal{O})$ is an upper bound on the oracle's regret and $β$ is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class $\mathcal{F}$ and show that its stability parameter $β$ is bounded by $\log |\mathcal{F}|$, resulting in an expected regret bound of $O(\sqrt{KT \log |\mathcal{F}|} + \sqrt{d_{\max} D \log |\mathcal{F}|})$ which is a $\sqrt{d_{\max}}$ factor away from the lower bound of $Ω(\sqrt{KT \log |\mathcal{F}|} + \sqrt{D \log |\mathcal{F}|})$ that we also present.
