Table of Contents
Fetching ...

Representation Learning for Online and Offline RL in Low-rank MDPs

Masatoshi Uehara, Xuezhou Zhang, Wen Sun

TL;DR

This paper tackles representation learning in reinforcement learning under unknown low-rank MDPs, proposing online Rep-UCB and offline Rep-LCB to achieve sample-efficient learning without prespecified representations. By learning φ̂ via MLE and using Bonus-based planning, REP-UCB attains significantly improved online sample complexity while remaining oracle-efficient; Rep-LCB provides PAC-style guarantees under partial offline coverage through pessimism and a relative condition number. The analysis introduces novel optimism/pessimism techniques that tie learned representations to fixed true-feature potential functions, enabling tractable planning and rigorous guarantees despite nonconvex function approximation. Overall, the work delivers the first oracle-efficient, provable online and offline methods for representation learning in low-rank MDPs, with practical implications for sample-efficient RL in high-dimensional settings.

Abstract

This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE (Agarwal et.al), the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven Representation learning for RL), which significantly improves the sample complexity from $\widetilde{O}( A^9 d^7 / (ε^{10} (1-γ)^{22}))$ for FLAMBE to $\widetilde{O}( A^2 d^4 / (ε^2 (1-γ)^{5}) )$ with $d$ being the rank of the transition matrix (or dimension of the ground truth representation), $A$ being the number of actions, and $γ$ being the discounted factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explore-then-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline distribution.

Representation Learning for Online and Offline RL in Low-rank MDPs

TL;DR

This paper tackles representation learning in reinforcement learning under unknown low-rank MDPs, proposing online Rep-UCB and offline Rep-LCB to achieve sample-efficient learning without prespecified representations. By learning φ̂ via MLE and using Bonus-based planning, REP-UCB attains significantly improved online sample complexity while remaining oracle-efficient; Rep-LCB provides PAC-style guarantees under partial offline coverage through pessimism and a relative condition number. The analysis introduces novel optimism/pessimism techniques that tie learned representations to fixed true-feature potential functions, enabling tractable planning and rigorous guarantees despite nonconvex function approximation. Overall, the work delivers the first oracle-efficient, provable online and offline methods for representation learning in low-rank MDPs, with practical implications for sample-efficient RL in high-dimensional settings.

Abstract

This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE (Agarwal et.al), the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven Representation learning for RL), which significantly improves the sample complexity from for FLAMBE to with being the rank of the transition matrix (or dimension of the ground truth representation), being the number of actions, and being the discounted factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explore-then-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline distribution.

Paper Structure

This paper contains 25 sections, 21 theorems, 117 equations, 1 figure, 1 table, 2 algorithms.

Key Result

Theorem 4

Fix $\delta\in(0,1), \epsilon\in(0,1)$. Let $\hat{\pi}$ be a uniform mixture of $\pi_1,\cdots,\pi_N$ and $\pi^{\star}:=\mathop{\mathrm{arg\,max}}\limits_{\pi} V^{\pi}_{P^{\star},r}$ as the optimal policy. By setting parameters as follows: with probability at least $1-\delta$, we have where the number of collected samples is at most where $\nu$ only contains log terms and the dependence on $|\ma

Figures (1)

  • Figure 1: A latent state model captured by low-rank MDP. Here $\phi^\star(s,a)$ is a distribution over a discrete latent state space $\mathcal{Z}$. Note that this is still a Markovian model as there is no transition among latent states.

Theorems & Definitions (24)

  • Definition 1: Low-rank MDP
  • Definition 3: Maximum Likelihood Oracle (MLE)
  • Theorem 4: PAC Bound for Rep-UCB
  • Lemma 5: Almost Optimism at the Initial State Distribution
  • Theorem 6: PAC Bound for Rep-LCB
  • Lemma 7: Almost Pessimism at the Initial State Distribution
  • Lemma 8: Almost Optimism at the Initial Distribution
  • Lemma 9: Regret
  • Theorem 10: PAC guarantee of Rep-UCB
  • Lemma 11: Concentration of the bonus term
  • ...and 14 more