Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning

Gen Li; Yuling Yan; Yuxin Chen; Jianqing Fan

Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning

Gen Li, Yuling Yan, Yuxin Chen, Jianqing Fan

TL;DR

The paper tackles reward-agnostic exploration in finite-horizon MDPs, proposing a two-stage framework that collects data without reward information and subsequently learns policies for multiple reward functions. By leveraging occupancy distributions as reward-agnostic statistics and integrating a pessimistic, model-based offline RL learner, the authors achieve minimax-optimal sample complexity: $\\tilde{O}(\\frac{H^3SA}{\\varepsilon^2})$ for a polynomially many rewards, and a reward-free guarantee of $\\tilde{O}(\\frac{H^3S^2A}{\\varepsilon^2})$ to handle adversarial reward sets. The approach blends offline RL techniques with careful occupancy estimation and Frank-Wolfe-type optimization to select exploration policies, enabling efficient data collection and robust policy learning across reward functions. This work bridges online exploration with offline learning, delivering practical minimax guarantees and clarifying the roles of reward-agnostic quantities in achieving data-efficient, reward-agnostic and reward-free exploration in episodic MDPs.

Abstract

This paper studies reward-agnostic exploration in reinforcement learning (RL) -- a scenario where the learner is unware of the reward functions during the exploration stage -- and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with $S$ states, $A$ actions, and horizon length $H$, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of \begin{align*} \frac{SAH^3}{\varepsilon^2} \text{ sample episodes (up to log factor)} \end{align*} without guidance of the reward information, our algorithm is able to find $\varepsilon$-optimal policies for all these reward functions, provided that $\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds $\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield $\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as ``reward-free exploration.'' The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms.

Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning

TL;DR

for a polynomially many rewards, and a reward-free guarantee of

to handle adversarial reward sets. The approach blends offline RL techniques with careful occupancy estimation and Frank-Wolfe-type optimization to select exploration policies, enabling efficient data collection and robust policy learning across reward functions. This work bridges online exploration with offline learning, delivering practical minimax guarantees and clarifying the roles of reward-agnostic quantities in achieving data-efficient, reward-agnostic and reward-free exploration in episodic MDPs.

Abstract

states,

actions, and horizon length

, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of \begin{align*} \frac{SAH^3}{\varepsilon^2} \text{ sample episodes (up to log factor)} \end{align*} without guidance of the reward information, our algorithm is able to find

-optimal policies for all these reward functions, provided that

is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds

episodes (up to log factor), our algorithm is able to yield

accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as ``reward-free exploration.'' The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms.

Paper Structure (47 sections, 8 theorems, 111 equations, 1 table, 4 algorithms)

This paper contains 47 sections, 8 theorems, 111 equations, 1 table, 4 algorithms.

Introduction
Reward-agnostic exploration
This paper
Notation
Problem formulation
Basics of Markov decision processes.
Learning processes and goals.
Algorithm
A two-stage algorithm
Initialization.
Stage 1.1: estimating occupancy distributions.
Stage 1.2: computing a behavior policy and drawing samples.
Stage 2: policy learning via offline RL.
Subroutines: approximately solving the subproblems \ref{['defi:target-empirical-h']} and \ref{['defi:target-empirical']}
Subroutine for solving the subproblem \ref{['defi:target-empirical']}.
...and 32 more sections

Key Result

Lemma 1

Armed with the learning rate eq:learning-rates-alpha-intuition and the stopping rule eq:termination, this subroutine terminates within $O(HSA\log (KH))$ iterations.

Theorems & Definitions (10)

Lemma 1
Lemma 2: kiefer1960equivalence
Theorem 1: Reward-agnostic RL
Theorem 2: Reward-free RL
Theorem 3
Lemma 3
proof
Lemma 4
proof
Lemma 5

Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning

TL;DR

Abstract

Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (10)