Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback
Asaf Cassel, Haipeng Luo, Aviv Rosenberg, Dmitry Sotnikov
TL;DR
This work addresses RL under aggregate feedback in linear MDPs, where only episode-end reward sums are observed. It introduces two algorithms, RE-LSVI and REPO, that leverage an ensemble of perturbed Q-functions to obtain near-optimal regret via optimism (RE-LSVI) and hedging-based policy optimization (REPO). The main contributions are a novel ensemble randomization technique with loose truncation to keep estimates bounded and bias-free, and the first PO approach for ABF, achieving regret of roughly $\tilde{O}(\sqrt{d^5 H^7 K})$ for RE-LSVI and $\tilde{O}(\sqrt{d^5 H^9 K})$ for REPO (up to problem-dependent factors). These results extend ABF from tabular to linear function approximation, enabling scalable ABF RL with theoretical guarantees and practical relevance for settings where per-step rewards are unavailable.
Abstract
In many real-world applications, it is hard to provide a reward signal in each step of a Reinforcement Learning (RL) process and more natural to give feedback when an episode ends. To this end, we study the recently proposed model of RL with Aggregate Bandit Feedback (RL-ABF), where the agent only observes the sum of rewards at the end of an episode instead of each reward individually. Prior work studied RL-ABF only in tabular settings, where the number of states is assumed to be small. In this paper, we extend ABF to linear function approximation and develop two efficient algorithms with near-optimal regret guarantees: a value-based optimistic algorithm built on a new randomization technique with a Q-functions ensemble, and a policy optimization algorithm that uses a novel hedging scheme over the ensemble.
