Provably Efficient Representation Selection in Low-rank Markov Decision Processes: From Online to Offline RL
Weitong Zhang, Jiafan He, Dongruo Zhou, Amy Zhang, Quanquan Gu
TL;DR
This work addresses sample-efficient reinforcement learning by adaptively selecting representations from a class of valid representations in bilinear low-rank MDPs. It introduces ReLEX, with online (ReLEX-UCB) and offline (ReLEX-LCB) variants, to choose representations per state-action pair rather than a single universal representation. The online algorithm achieves a problem-dependent constant regret once a positive gap and sufficient coverage are satisfied, while the offline algorithm attains gap-dependent sample complexity and, under aggregate coverage, constant-sample performance. Empirically, ReLEX schemes outperform single-representation baselines in both online and offline settings, highlighting the practical value of representation selection for RL efficiency and robustness in sensor-rich or heterogeneous environments.
Abstract
The success of deep reinforcement learning (DRL) lies in its ability to learn a representation that is well-suited for the exploration and exploitation task. To understand how the choice of representation can improve the efficiency of reinforcement learning (RL), we study representation selection for a class of low-rank Markov Decision Processes (MDPs) where the transition kernel can be represented in a bilinear form. We propose an efficient algorithm, called ReLEX, for representation learning in both online and offline RL. Specifically, we show that the online version of ReLEX, called ReLEX-UCB, always performs no worse than the state-of-the-art algorithm without representation selection, and achieves a strictly better constant regret if the representation function class has a "coverage" property over the entire state-action space. For the offline counterpart, ReLEX-LCB, we show that the algorithm can find the optimal policy if the representation class can cover the state-action space and achieves gap-dependent sample complexity. This is the first result with constant sample complexity for representation learning in offline RL.
