Table of Contents
Fetching ...

Provably Efficient Representation Selection in Low-rank Markov Decision Processes: From Online to Offline RL

Weitong Zhang, Jiafan He, Dongruo Zhou, Amy Zhang, Quanquan Gu

TL;DR

This work addresses sample-efficient reinforcement learning by adaptively selecting representations from a class of valid representations in bilinear low-rank MDPs. It introduces ReLEX, with online (ReLEX-UCB) and offline (ReLEX-LCB) variants, to choose representations per state-action pair rather than a single universal representation. The online algorithm achieves a problem-dependent constant regret once a positive gap and sufficient coverage are satisfied, while the offline algorithm attains gap-dependent sample complexity and, under aggregate coverage, constant-sample performance. Empirically, ReLEX schemes outperform single-representation baselines in both online and offline settings, highlighting the practical value of representation selection for RL efficiency and robustness in sensor-rich or heterogeneous environments.

Abstract

The success of deep reinforcement learning (DRL) lies in its ability to learn a representation that is well-suited for the exploration and exploitation task. To understand how the choice of representation can improve the efficiency of reinforcement learning (RL), we study representation selection for a class of low-rank Markov Decision Processes (MDPs) where the transition kernel can be represented in a bilinear form. We propose an efficient algorithm, called ReLEX, for representation learning in both online and offline RL. Specifically, we show that the online version of ReLEX, called ReLEX-UCB, always performs no worse than the state-of-the-art algorithm without representation selection, and achieves a strictly better constant regret if the representation function class has a "coverage" property over the entire state-action space. For the offline counterpart, ReLEX-LCB, we show that the algorithm can find the optimal policy if the representation class can cover the state-action space and achieves gap-dependent sample complexity. This is the first result with constant sample complexity for representation learning in offline RL.

Provably Efficient Representation Selection in Low-rank Markov Decision Processes: From Online to Offline RL

TL;DR

This work addresses sample-efficient reinforcement learning by adaptively selecting representations from a class of valid representations in bilinear low-rank MDPs. It introduces ReLEX, with online (ReLEX-UCB) and offline (ReLEX-LCB) variants, to choose representations per state-action pair rather than a single universal representation. The online algorithm achieves a problem-dependent constant regret once a positive gap and sufficient coverage are satisfied, while the offline algorithm attains gap-dependent sample complexity and, under aggregate coverage, constant-sample performance. Empirically, ReLEX schemes outperform single-representation baselines in both online and offline settings, highlighting the practical value of representation selection for RL efficiency and robustness in sensor-rich or heterogeneous environments.

Abstract

The success of deep reinforcement learning (DRL) lies in its ability to learn a representation that is well-suited for the exploration and exploitation task. To understand how the choice of representation can improve the efficiency of reinforcement learning (RL), we study representation selection for a class of low-rank Markov Decision Processes (MDPs) where the transition kernel can be represented in a bilinear form. We propose an efficient algorithm, called ReLEX, for representation learning in both online and offline RL. Specifically, we show that the online version of ReLEX, called ReLEX-UCB, always performs no worse than the state-of-the-art algorithm without representation selection, and achieves a strictly better constant regret if the representation function class has a "coverage" property over the entire state-action space. For the offline counterpart, ReLEX-LCB, we show that the algorithm can find the optimal policy if the representation class can cover the state-action space and achieves gap-dependent sample complexity. This is the first result with constant sample complexity for representation learning in offline RL.

Paper Structure

This paper contains 37 sections, 20 theorems, 120 equations, 2 figures, 7 tables, 2 algorithms.

Key Result

Theorem 4.5

Under Assumptions asm:gap and asm:hls, set $\beta_{k, \bm{\phi}} = c(C_{\mathbf{M}} + {C'_{\bm{\psi}}}^{2})d_{\bm{\phi}}\log(kHC_{\bm{\phi}} |\Phi| / \delta)$ in Algorithm alg:main, where $c$ is an absolute positive constant, then with probability at least $1 - 5\delta$, there exists a threshold independent from episode number $k$. The regret for the first $k$ episodes is upper bounded by where

Figures (2)

  • Figure 1: Cumulative regret over 5M episodes for ReLEX-UCB v.s. UC-MatrixRL and $\epsilon$-greedy using a single representation.
  • Figure 2: Relative sub-optimality of ReLEX-LCB after 500K offline episodes

Theorems & Definitions (39)

  • Definition 3.2: Bilinear MDPs, yang2020reinforcement
  • Definition 3.3: Admissible Function Class
  • Remark 3.4
  • Remark 3.5
  • Remark 3.6
  • Remark 4.1
  • Remark 4.3
  • Remark 4.4
  • Theorem 4.5
  • Remark 4.6
  • ...and 29 more