Improved Bounds for Reward-Agnostic and Reward-Free Exploration

Oran Ridel; Alon Cohen

Improved Bounds for Reward-Agnostic and Reward-Free Exploration

Oran Ridel, Alon Cohen

TL;DR

A new algorithm is proposed that significantly relaxes the requirement on $\epsilon$ for reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), establishing a tight lower bound for reward-free exploration.

Abstract

We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable $ε$-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets $ε$-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter $ε$. We propose a new algorithm that significantly relaxes the requirement on $ε$. Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an $ε$-optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.

Improved Bounds for Reward-Agnostic and Reward-Free Exploration

TL;DR

A new algorithm is proposed that significantly relaxes the requirement on

for reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), establishing a tight lower bound for reward-free exploration.

Abstract

-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets

-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter

. We propose a new algorithm that significantly relaxes the requirement on

. Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an

-optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.

Paper Structure (28 sections, 35 theorems, 101 equations, 2 figures, 1 table, 5 algorithms)

This paper contains 28 sections, 35 theorems, 101 equations, 2 figures, 1 table, 5 algorithms.

Introduction
Preliminaries
Problem Formulation and Main Results
Main results
Algorithm Overview
Main idea: convex optimization perspective
Exploration policy creation
Analysis and proof sketch.
Dynamics and Policy Estimation
Lower bound for reward-free exploration
Conclusion
Omitted details for \ref{['sec:policy_set_creation']}
Online mirror descent
Exploration Policy Algorithm
Proof of lemmas
...and 13 more sections

Key Result

Theorem 2.1

For any $u \in \Lambda$, and any $r^1,\dots,r^T \in \mathbb{R}^d$, OMD guarantees:

Figures (2)

Figure 1: Multiple states MDP construction for lower bound. Solid lines represent deterministic transition, and dashed lines represent probabilistic transitions. Blue, red and green represent classes of deterministic actions (see \ref{['def:deterministic_actions']}).
Figure 2: Single state lower bound scheme MDP construction for lower bound. Solid lines represent deterministic transition, and dashed lines represent probabilistic transitions.

Theorems & Definitions (60)

Theorem 2.1
Proposition 3.1
Proposition 3.2
Lemma 3.3: Informal
Definition 4.1
Theorem 4.2
Lemma 4.3
Lemma 4.4: Best policy cumulative reward lower bound
Lemma 4.5: Learner cumulative reward upper bound
Corollary 4.6
...and 50 more

Improved Bounds for Reward-Agnostic and Reward-Free Exploration

TL;DR

Abstract

Improved Bounds for Reward-Agnostic and Reward-Free Exploration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (60)