Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

Asaf Cassel; Haipeng Luo; Aviv Rosenberg; Dmitry Sotnikov

Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

Asaf Cassel, Haipeng Luo, Aviv Rosenberg, Dmitry Sotnikov

TL;DR

This work addresses RL under aggregate feedback in linear MDPs, where only episode-end reward sums are observed. It introduces two algorithms, RE-LSVI and REPO, that leverage an ensemble of perturbed Q-functions to obtain near-optimal regret via optimism (RE-LSVI) and hedging-based policy optimization (REPO). The main contributions are a novel ensemble randomization technique with loose truncation to keep estimates bounded and bias-free, and the first PO approach for ABF, achieving regret of roughly $\tilde{O}(\sqrt{d^5 H^7 K})$ for RE-LSVI and $\tilde{O}(\sqrt{d^5 H^9 K})$ for REPO (up to problem-dependent factors). These results extend ABF from tabular to linear function approximation, enabling scalable ABF RL with theoretical guarantees and practical relevance for settings where per-step rewards are unavailable.

Abstract

In many real-world applications, it is hard to provide a reward signal in each step of a Reinforcement Learning (RL) process and more natural to give feedback when an episode ends. To this end, we study the recently proposed model of RL with Aggregate Bandit Feedback (RL-ABF), where the agent only observes the sum of rewards at the end of an episode instead of each reward individually. Prior work studied RL-ABF only in tabular settings, where the number of states is assumed to be small. In this paper, we extend ABF to linear function approximation and develop two efficient algorithms with near-optimal regret guarantees: a value-based optimistic algorithm built on a new randomization technique with a Q-functions ensemble, and a policy optimization algorithm that uses a novel hedging scheme over the ensemble.

Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

TL;DR

for RE-LSVI and

for REPO (up to problem-dependent factors). These results extend ABF from tabular to linear function approximation, enabling scalable ABF RL with theoretical guarantees and practical relevance for settings where per-step rewards are unavailable.

Abstract

Paper Structure (57 sections, 40 theorems, 149 equations, 4 algorithms)

This paper contains 57 sections, 40 theorems, 149 equations, 4 algorithms.

Introduction
Related Work.
Problem Setup
Markov Decision Process (MDP).
Linear MDP.
Policy and Value.
Aggregate Feedback and Regret.
Algorithms and Main Results
Notation.
Randomized Ensemble Least Squares Value Iteration (RE-LSVI)
Discussion.
Randomized Ensemble Policy Optimization (REPO)
Discussion.
REPO for Tabular MDPs with ABF.
Analysis
...and 42 more sections

Key Result

theorem 1

Suppose that we run RE-LSVI (alg:RE-LSVI) with the parameters defined in lemma:good-event (in appendix-sec:RE-LSVI). Then with probability at least $1 - \delta$, we have

Theorems & Definitions (66)

theorem 1
theorem 2
Lemma 3
Proof
Lemma 4: Good event
Lemma 5
Lemma 6: Optimism
Proof
Lemma 7: Cost of optimism
Proof
...and 56 more

Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

TL;DR

Abstract

Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (66)