Table of Contents
Fetching ...

Can We Really Learn One Representation to Optimize All Rewards?

Chongyi Zheng, Royina Karegoudra Jayanth, Benjamin Eysenbach

TL;DR

The paper scrutinizes forward-backward (FB) representation learning as a means to pretrain a single latent representation that enables zero-shot optimization for any reward in reinforcement learning. It shows that ground-truth FB representations require strict rank conditions and that the FB objective is a TD-like LSIF loss tied to a non-contractive FB Bellman operator, which can hinder convergence. To address these issues, the authors propose one-step FB, which fixes the behavioral policy and performs one-step policy improvement, learning forward and backward representations via a TD-one-step LSIF loss with orthonormal regularization. Empirically, one-step FB converges reliably in didactic and real-world tasks, yielding up to $10^5$ lower errors and about a 24% average gain in zero-shot performance across 8 state-based and 2 image-based domains, and providing a strong initialization for subsequent fine-tuning. The work offers a practical unsupervised pre-training method with solid theoretical grounding and demonstrated empirical benefits, while clarifying the limitations of universal FB representations for solving all rewards.

Abstract

As machine learning has moved towards leveraging large models as priors for downstream tasks, the community has debated the right form of prior for solving reinforcement learning (RL) problems. If one were to try to prefetch as much computation as possible, they would attempt to learn a prior over the policies for some yet-to-be-determined reward function. Recent work (forward-backward (FB) representation learning) has tried this, arguing that an unsupervised representation learning procedure can enable optimal control over arbitrary rewards without further fine-tuning. However, FB's training objective and learning behavior remain mysterious. In this paper, we demystify FB by clarifying when such representations can exist, what its objective optimizes, and how it converges in practice. We draw connections with rank matching, fitted Q-evaluation, and contraction mapping. Our analysis suggests a simplified unsupervised pre-training method for RL that, instead of enabling optimal control, performs one step of policy improvement. We call our proposed method $\textbf{one-step forward-backward representation learning (one-step FB)}$. Experiments in didactic settings, as well as in $10$ state-based and image-based continuous control domains, demonstrate that one-step FB converges to errors $10^5$ smaller and improves zero-shot performance by $+24\%$ on average. Our project website is available at https://chongyi-zheng.github.io/onestep-fb.

Can We Really Learn One Representation to Optimize All Rewards?

TL;DR

The paper scrutinizes forward-backward (FB) representation learning as a means to pretrain a single latent representation that enables zero-shot optimization for any reward in reinforcement learning. It shows that ground-truth FB representations require strict rank conditions and that the FB objective is a TD-like LSIF loss tied to a non-contractive FB Bellman operator, which can hinder convergence. To address these issues, the authors propose one-step FB, which fixes the behavioral policy and performs one-step policy improvement, learning forward and backward representations via a TD-one-step LSIF loss with orthonormal regularization. Empirically, one-step FB converges reliably in didactic and real-world tasks, yielding up to lower errors and about a 24% average gain in zero-shot performance across 8 state-based and 2 image-based domains, and providing a strong initialization for subsequent fine-tuning. The work offers a practical unsupervised pre-training method with solid theoretical grounding and demonstrated empirical benefits, while clarifying the limitations of universal FB representations for solving all rewards.

Abstract

As machine learning has moved towards leveraging large models as priors for downstream tasks, the community has debated the right form of prior for solving reinforcement learning (RL) problems. If one were to try to prefetch as much computation as possible, they would attempt to learn a prior over the policies for some yet-to-be-determined reward function. Recent work (forward-backward (FB) representation learning) has tried this, arguing that an unsupervised representation learning procedure can enable optimal control over arbitrary rewards without further fine-tuning. However, FB's training objective and learning behavior remain mysterious. In this paper, we demystify FB by clarifying when such representations can exist, what its objective optimizes, and how it converges in practice. We draw connections with rank matching, fitted Q-evaluation, and contraction mapping. Our analysis suggests a simplified unsupervised pre-training method for RL that, instead of enabling optimal control, performs one step of policy improvement. We call our proposed method . Experiments in didactic settings, as well as in state-based and image-based continuous control domains, demonstrate that one-step FB converges to errors smaller and improves zero-shot performance by on average. Our project website is available at https://chongyi-zheng.github.io/onestep-fb.
Paper Structure (57 sections, 9 theorems, 74 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 57 sections, 9 theorems, 74 equations, 10 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Given any discrete CMP, a finite latent space ${\mathcal{Z}}$, and a marginal measure $\rho$, any FB representation matrices $F^{\star}_{\mathcal{Z}} \in \mathbb{R}^{\lvert{\mathcal{Z}} \times {\mathcal{S}} \times {\mathcal{A}}\rvert \times d}$ and $B^{\star} \in \mathbb{R}^{d \times \lvert{\mathcal

Figures (10)

  • Figure 1: How can we learn a library of policies to quickly maximize new rewards? (Left) Forward-backward representation learning (FB) touati2021learning factorizes their successor measures into bilinear representations, and uses those representations to acquire new policies. (Right) Our theoretical analysis of this method reveals some optimization challenges, which are alleviated through a simplified method that achieves $24\%$ higher returns in practice.
  • Figure 2: The three-state CMP. Agents start from state $s_0$ and take action $a_i$ ($i = 0, 1, 2$) to determinstically transit into state $s_i$. States $s_1$ and $s_2$ are both absorbing states. Sections \ref{['subsec:fb-didactic-exp']} and \ref{['subsec:onestep-fb-didactic-exp']} will use this simple MDP to study the convergence of the FB and the one-step FB algorithms.
  • Figure 3: Learning FB representations in the three-state CMP (Fig. \ref{['fig:didactic-cmp']}).(Left) After training for $10^5$ gradient steps, FB fails to converge to a pair of ground-truth FB representations. (Right) Given a fixed policy, one-step FB exactly fits the ground-truth one-step FB representations within $4 \times 10^4$ gradient steps, suggesting that our method is simpler and stable. These observations are consistent with our theoretical analysis (Sec. \ref{['subsec:fb-convergence']}) and the motivation for developing a new method (Sec. \ref{['subsec:onestep-fb-repr-obj']}).
  • Figure 4: Domains for evaluation.(Top) ExORL domains ($16$ state-based tasks). (Bottom) OGBench domains ($20$ state-based tasks and $10$ image-based tasks).
  • Figure 5: Fine-tuning pre-trained agents on downstream tasks. After offline pre-training, we conduct online fine-tuning on various methods using the same off-the-shelf RL algorithm (TD3). One-step FB continues to provide higher sample efficiency ($+40\%$ on average) during fine-tuning, as compared with the original FB method.
  • ...and 5 more figures

Theorems & Definitions (21)

  • Definition 1: Informal; Definition 1 of touati2021learning
  • Definition 2: Informal; Theorem 2 of touati2021learning
  • Proposition 1: Informal
  • Corollary 1
  • Corollary 2: Informal
  • Proposition 2: Informal
  • Definition 3
  • Remark 1
  • Proposition 3: Informal
  • Lemma 1: Lemma 1.6 and Corollary 1.5 of agarwal2019reinforcement
  • ...and 11 more