Table of Contents
Fetching ...

Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

Asaf Cassel, Haipeng Luo, Aviv Rosenberg, Dmitry Sotnikov

TL;DR

This work addresses RL under aggregate feedback in linear MDPs, where only episode-end reward sums are observed. It introduces two algorithms, RE-LSVI and REPO, that leverage an ensemble of perturbed Q-functions to obtain near-optimal regret via optimism (RE-LSVI) and hedging-based policy optimization (REPO). The main contributions are a novel ensemble randomization technique with loose truncation to keep estimates bounded and bias-free, and the first PO approach for ABF, achieving regret of roughly $\tilde{O}(\sqrt{d^5 H^7 K})$ for RE-LSVI and $\tilde{O}(\sqrt{d^5 H^9 K})$ for REPO (up to problem-dependent factors). These results extend ABF from tabular to linear function approximation, enabling scalable ABF RL with theoretical guarantees and practical relevance for settings where per-step rewards are unavailable.

Abstract

In many real-world applications, it is hard to provide a reward signal in each step of a Reinforcement Learning (RL) process and more natural to give feedback when an episode ends. To this end, we study the recently proposed model of RL with Aggregate Bandit Feedback (RL-ABF), where the agent only observes the sum of rewards at the end of an episode instead of each reward individually. Prior work studied RL-ABF only in tabular settings, where the number of states is assumed to be small. In this paper, we extend ABF to linear function approximation and develop two efficient algorithms with near-optimal regret guarantees: a value-based optimistic algorithm built on a new randomization technique with a Q-functions ensemble, and a policy optimization algorithm that uses a novel hedging scheme over the ensemble.

Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

TL;DR

This work addresses RL under aggregate feedback in linear MDPs, where only episode-end reward sums are observed. It introduces two algorithms, RE-LSVI and REPO, that leverage an ensemble of perturbed Q-functions to obtain near-optimal regret via optimism (RE-LSVI) and hedging-based policy optimization (REPO). The main contributions are a novel ensemble randomization technique with loose truncation to keep estimates bounded and bias-free, and the first PO approach for ABF, achieving regret of roughly for RE-LSVI and for REPO (up to problem-dependent factors). These results extend ABF from tabular to linear function approximation, enabling scalable ABF RL with theoretical guarantees and practical relevance for settings where per-step rewards are unavailable.

Abstract

In many real-world applications, it is hard to provide a reward signal in each step of a Reinforcement Learning (RL) process and more natural to give feedback when an episode ends. To this end, we study the recently proposed model of RL with Aggregate Bandit Feedback (RL-ABF), where the agent only observes the sum of rewards at the end of an episode instead of each reward individually. Prior work studied RL-ABF only in tabular settings, where the number of states is assumed to be small. In this paper, we extend ABF to linear function approximation and develop two efficient algorithms with near-optimal regret guarantees: a value-based optimistic algorithm built on a new randomization technique with a Q-functions ensemble, and a policy optimization algorithm that uses a novel hedging scheme over the ensemble.
Paper Structure (57 sections, 40 theorems, 149 equations, 4 algorithms)

This paper contains 57 sections, 40 theorems, 149 equations, 4 algorithms.

Key Result

theorem 1

Suppose that we run RE-LSVI (alg:RE-LSVI) with the parameters defined in lemma:good-event (in appendix-sec:RE-LSVI). Then with probability at least $1 - \delta$, we have

Theorems & Definitions (66)

  • theorem 1
  • theorem 2
  • Lemma 3
  • Proof
  • Lemma 4: Good event
  • Lemma 5
  • Lemma 6: Optimism
  • Proof
  • Lemma 7: Cost of optimism
  • Proof
  • ...and 56 more