Provably Efficient Representation Selection in Low-rank Markov Decision Processes: From Online to Offline RL

Weitong Zhang; Jiafan He; Dongruo Zhou; Amy Zhang; Quanquan Gu

Provably Efficient Representation Selection in Low-rank Markov Decision Processes: From Online to Offline RL

Weitong Zhang, Jiafan He, Dongruo Zhou, Amy Zhang, Quanquan Gu

TL;DR

This work addresses sample-efficient reinforcement learning by adaptively selecting representations from a class of valid representations in bilinear low-rank MDPs. It introduces ReLEX, with online (ReLEX-UCB) and offline (ReLEX-LCB) variants, to choose representations per state-action pair rather than a single universal representation. The online algorithm achieves a problem-dependent constant regret once a positive gap and sufficient coverage are satisfied, while the offline algorithm attains gap-dependent sample complexity and, under aggregate coverage, constant-sample performance. Empirically, ReLEX schemes outperform single-representation baselines in both online and offline settings, highlighting the practical value of representation selection for RL efficiency and robustness in sensor-rich or heterogeneous environments.

Abstract

The success of deep reinforcement learning (DRL) lies in its ability to learn a representation that is well-suited for the exploration and exploitation task. To understand how the choice of representation can improve the efficiency of reinforcement learning (RL), we study representation selection for a class of low-rank Markov Decision Processes (MDPs) where the transition kernel can be represented in a bilinear form. We propose an efficient algorithm, called ReLEX, for representation learning in both online and offline RL. Specifically, we show that the online version of ReLEX, called ReLEX-UCB, always performs no worse than the state-of-the-art algorithm without representation selection, and achieves a strictly better constant regret if the representation function class has a "coverage" property over the entire state-action space. For the offline counterpart, ReLEX-LCB, we show that the algorithm can find the optimal policy if the representation class can cover the state-action space and achieves gap-dependent sample complexity. This is the first result with constant sample complexity for representation learning in offline RL.

Provably Efficient Representation Selection in Low-rank Markov Decision Processes: From Online to Offline RL

TL;DR

Abstract

Provably Efficient Representation Selection in Low-rank Markov Decision Processes: From Online to Offline RL

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (39)