Bridging State and History Representations: Understanding Self-Predictive RL

Tianwei Ni; Benjamin Eysenbach; Erfan Seyedsalehi; Michel Ma; Clement Gehring; Aditya Mahajan; Pierre-Luc Bacon

Bridging State and History Representations: Understanding Self-Predictive RL

Tianwei Ni, Benjamin Eysenbach, Erfan Seyedsalehi, Michel Ma, Clement Gehring, Aditya Mahajan, Pierre-Luc Bacon

TL;DR

The paper unifies state and history representations in RL under a self-predictive abstraction, connecting multiple existing objectives (reward, latent-state, and observation predictions) through an implication graph. It provides theoretical insights into why stop-gradient targets help learning self-predictive representations and introduces a Minimalist phi_L algorithm that end-to-end learns these representations without reward modeling or planning. Empirically, the approach is validated across standard MDPs, distracting MDPs, and sparse-reward POMDPs, demonstrating improved sample efficiency and robustness and offering practical guidelines for practitioners. Overall, the work clarifies when and how self-predictive and observation-predictive representations are advantageous and offers a simple, end-to-end baseline for disentangling representation learning from policy optimization in RL.

Abstract

Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of preliminary guidelines for RL practitioners.

Bridging State and History Representations: Understanding Self-Predictive RL

TL;DR

Abstract

Paper Structure (79 sections, 20 theorems, 72 equations, 16 figures, 4 tables, 2 algorithms)

This paper contains 79 sections, 20 theorems, 72 equations, 16 figures, 4 tables, 2 algorithms.

Introduction
Background
A Unified View on State and History Representations
An Implication Graph of Representations in RL
Which Representations Do Prior Methods Learn?
On Learning Self-Predictive Representations in RL
Are Practical \ref{['eq:ZP']} Objectives Biased?
Why Do Stop-Gradients Work for \ref{['eq:ZP']} Optimization?
A Minimalist RL Algorithm for Learning Self-Predictive Representations
Experiments
State Representation Learning in Standard MDPs
State Representation Learning in Distracting MDPs
History Representation Learning in Sparse-Reward POMDPs
Discussion
Appendix
...and 64 more sections

Key Result

Theorem 1

An encoder satisfying $\phi_O$ also belongs to $\phi_L$; an encoder satisfying $\phi_L$ also belongs to $\phi_{Q^*}$; the reverse is not necessarily true.

Figures (16)

Figure 1: An implication graph showing the relations between the conditions on history representations. The source nodes of the edges with the same color together imply the target node. In MDPs, \ref{['eq:OR']} implies all the other conditions. All the connections are discovered in this work, except for (1) \ref{['eq:OP']} + \ref{['eq:Rec']} implying \ref{['eq:ZP']}, (2) \ref{['eq:ZP']} + \ref{['eq:RP']} implying $\phi_{Q^*}$.
Figure 2: The absolute normalized inner product of the two column vectors in the learned encoder when using online, detached, or EMA \ref{['eq:ZP']} target in an MDP (left) and a POMDP (right). We plot the results for 100 different seeds, which controls the rollouts used to sample transition and the initialization of the representation. The bold lines represent the median of the seeds.
Figure 3: Decoupling representation learning from policy optimization using our algorithm based on ALM(3) ghugare2022simplifying. Comparison between $\phi_{Q^*}$ (TD3), $\phi_L$ (our algorithm (\ref{['eq:ZP']}-$\ell_2$, \ref{['eq:ZP']}-FKL, \ref{['eq:ZP']}-RKL) and ALM(3)), $\phi_O$ (\ref{['eq:OP']}-$\ell_2$, \ref{['eq:OP']}-FKL), in the standard MuJoCo benchmark for 500k steps, averaged over 12 seeds. The observation dimension increases from left figure to right figure ($17,17,111,376$).
Figure 4: Representation collapse with online targets. On four benchmark tasks, we observe that using the online \ref{['eq:ZP']} target in $\ell_2$ objectives results in lower returns (top) and low-rank representations (bottom). In line with our theory, using a detached or EMA \ref{['eq:ZP']} target mitigates the representational collapse and yields higher returns.
Figure 5: Self-predictive representations are more robust. Comparison between $\phi_{Q^*}$ (TD3), $\phi_L$ (\ref{['eq:ZP']}-$\ell_2$, \ref{['eq:ZP']}-FKL, \ref{['eq:ZP']}-RKL) using our algorithm, $\phi_O$ (\ref{['eq:OP']}-$\ell_2$, \ref{['eq:OP']}-FKL) in the distracting MuJoCo benchmark, varying the distractor dimension from $2^4$ to $2^8$, averaged over 12 seeds. The y-axis is final performance at 1.5M steps.
...and 11 more figures

Theorems & Definitions (43)

Theorem 1: Relationships between common abstractions (informal)
Theorem 2: \ref{['eq:ZP']} + $\phi_{Q^*}$ imply \ref{['eq:RP']}
Proposition 1: The practical $\ell_2$ objective \ref{['eq:l2']} is an upper bound of the ideal objective \ref{['eq:zp_loss']} $\mathcal{L}_{\ref{['eq:ZP']}\xspace,\ell}(\phi,\theta;h,a)$ that targets \ref{['eq:EZP']} condition. The equality holds in deterministic environments.
Proposition 2: The practical $\mathtt f$-divergence objective \ref{['eq:kl']} is an upper bound of the ideal objective \ref{['eq:zp_loss']} $\mathcal{L}_{\ref{['eq:ZP']}\xspace,D_\mathtt f}(\phi,\theta;h,a)$ that targets \ref{['eq:ZP']} condition. The equality holds in deterministic environments.
Proposition 3: The $\ell_2$ objective \ref{['eq:l2']} with stop gradients ($J_\ell(\phi,\theta,\overline \phi; h,a)$) ensures stationary points that satisfy \ref{['eq:EZP']}, but the $\ell_2$ objective with online targets lacks this guarantee.
Theorem 3: Stop-gradient provably avoids representational collapse in linear models
Theorem 4: Granularity of state and history abstractions (the formal version of \ref{['thm:hierarchy_informal']})
Proposition 4: $\Phi_L$ is equivalent to $\phi_L$
proof
Proposition 5: $\Phi_O$ is equivalent to $\phi_O$
...and 33 more

Bridging State and History Representations: Understanding Self-Predictive RL

TL;DR

Abstract

Bridging State and History Representations: Understanding Self-Predictive RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (43)