Representation Learning for Online and Offline RL in Low-rank MDPs

Masatoshi Uehara; Xuezhou Zhang; Wen Sun

Representation Learning for Online and Offline RL in Low-rank MDPs

Masatoshi Uehara, Xuezhou Zhang, Wen Sun

TL;DR

This paper tackles representation learning in reinforcement learning under unknown low-rank MDPs, proposing online Rep-UCB and offline Rep-LCB to achieve sample-efficient learning without prespecified representations. By learning φ̂ via MLE and using Bonus-based planning, REP-UCB attains significantly improved online sample complexity while remaining oracle-efficient; Rep-LCB provides PAC-style guarantees under partial offline coverage through pessimism and a relative condition number. The analysis introduces novel optimism/pessimism techniques that tie learned representations to fixed true-feature potential functions, enabling tractable planning and rigorous guarantees despite nonconvex function approximation. Overall, the work delivers the first oracle-efficient, provable online and offline methods for representation learning in low-rank MDPs, with practical implications for sample-efficient RL in high-dimensional settings.

Abstract

This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE (Agarwal et.al), the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven Representation learning for RL), which significantly improves the sample complexity from $\widetilde{O}( A^9 d^7 / (ε^{10} (1-γ)^{22}))$ for FLAMBE to $\widetilde{O}( A^2 d^4 / (ε^2 (1-γ)^{5}) )$ with $d$ being the rank of the transition matrix (or dimension of the ground truth representation), $A$ being the number of actions, and $γ$ being the discounted factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explore-then-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline distribution.

Representation Learning for Online and Offline RL in Low-rank MDPs

TL;DR

Abstract

Representation Learning for Online and Offline RL in Low-rank MDPs

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (24)