Efficient Model-Free Exploration in Low-Rank MDPs

Zakaria Mhammedi; Adam Block; Dylan J. Foster; Alexander Rakhlin

Efficient Model-Free Exploration in Low-Rank MDPs

Zakaria Mhammedi, Adam Block, Dylan J. Foster, Alexander Rakhlin

TL;DR

This paper tackles sample-efficient exploration in high-dimensional reinforcement learning by formulating Low-Rank MDPs, where transitions admit a low-rank factorization with unknown embeddings and function approximation is essential.The authors introduce VoX, a model-free, computation-efficient algorithm that builds a layer-wise policy cover by learning a representation and computing a barycentric spanner as an exploration basis, interleaving RepLearn with policy optimization via PSDP and RobustSpanner.VoX achieves a reward-free exploration guarantee with a polynomial sample complexity in the feature dimension $d$, action count $A$, horizon $H$, and the accuracy parameter $rac{1}{\\varepsilon}$, while removing restrictive assumptions such as reachability or non-negativity of embeddings.The approach is modular and practical, leveraging approximate linear optimization oracles to enable scalable spanner computation and a minimax representation-learning objective that yields meaningful guarantees without latent-variable structure.

Abstract

A major challenge in reinforcement learning is to develop practical, sample-efficient algorithms for exploration in high-dimensional domains where generalization and function approximation is required. Low-Rank Markov Decision Processes -- where transition probabilities admit a low-rank factorization based on an unknown feature embedding -- offer a simple, yet expressive framework for RL with function approximation, but existing algorithms are either (1) computationally intractable, or (2) reliant upon restrictive statistical assumptions such as latent variable structure, access to model-based function approximation, or reachability. In this work, we propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs that is both computationally efficient and model-free, allowing for general function approximation and requiring no additional structural assumptions. Our algorithm, VoX, uses the notion of a barycentric spanner for the feature embedding as an efficiently computable basis for exploration, performing efficient barycentric spanner computation by interleaving representation learning and policy optimization. Our analysis -- which is appealingly simple and modular -- carefully combines several techniques, including a new approach to error-tolerant barycentric spanner computation and an improved analysis of a certain minimax representation learning objective found in prior work.

Efficient Model-Free Exploration in Low-Rank MDPs

TL;DR

Abstract

Paper Structure (50 sections, 29 theorems, 176 equations, 1 table, 5 algorithms)

This paper contains 50 sections, 29 theorems, 176 equations, 1 table, 5 algorithms.

Introduction
Contributions
Organization
Comparison to previous versions of the paper
Problem Setting
Low-Rank MDP Model
Policies and occupancy measures
Online Reinforcement Learning and Reward-Free Exploration
Function approximation and desiderata
Additional preliminaries
VoX: Algorithm and Main Results
Challenges and Related Work
The VoX Algorithm
Barycentric spanners
Barycentric spanner computation via approximate linear optimization
...and 35 more sections

Key Result

Lemma 1

If $\Psi\subseteq \Pi_{\texttt{M}}$ is a collection of policies such that $\{\mathbb{E}^\pi\left[ \phi^{\star}_{h}(\bm{x}_h, \bm{a}_h) \right]\mid \pi \in \Psi \}\subseteq \mathbb{R}^d$ is a $(C, \varepsilon)$-approximate barycentric spanner for $\mathcal{W}_h\coloneqq \{\mathbb{E}^\pi\left[ \ph

Theorems & Definitions (40)

Remark 1: Comparison to previous formulations
Definition 1: Approximate policy cover
Remark 2
Definition 2
Remark 3
Definition 3: awerbuch2008online
Lemma 1
Remark 4: Improved analysis of RepLearn
Theorem 4: Main theorem for VoX
Definition 5: Relative policy cover
...and 30 more

Efficient Model-Free Exploration in Low-Rank MDPs

TL;DR

Abstract

Efficient Model-Free Exploration in Low-Rank MDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (40)