Bridging State and History Representations: Understanding Self-Predictive RL
Tianwei Ni, Benjamin Eysenbach, Erfan Seyedsalehi, Michel Ma, Clement Gehring, Aditya Mahajan, Pierre-Luc Bacon
TL;DR
The paper unifies state and history representations in RL under a self-predictive abstraction, connecting multiple existing objectives (reward, latent-state, and observation predictions) through an implication graph. It provides theoretical insights into why stop-gradient targets help learning self-predictive representations and introduces a Minimalist phi_L algorithm that end-to-end learns these representations without reward modeling or planning. Empirically, the approach is validated across standard MDPs, distracting MDPs, and sparse-reward POMDPs, demonstrating improved sample efficiency and robustness and offering practical guidelines for practitioners. Overall, the work clarifies when and how self-predictive and observation-predictive representations are advantageous and offers a simple, end-to-end baseline for disentangling representation learning from policy optimization in RL.
Abstract
Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of preliminary guidelines for RL practitioners.
