Table of Contents
Fetching ...

Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models

Pranaya Jajoo, Harshit Sikchi, Siddhant Agarwal, Amy Zhang, Scott Niekum, Martha White

Abstract

Behavioral Foundation Models (BFMs) produce agents with the capability to adapt to any unknown reward or task. These methods, however, are only able to produce near-optimal policies for the reward functions that are in the span of some pre-existing state features, making the choice of state features crucial to the expressivity of the BFM. As a result, BFMs are trained using a variety of complex objectives and require sufficient dataset coverage, to train task-useful spanning features. In this work, we examine the question: are these complex representation learning objectives necessary for zero-shot RL? Specifically, we revisit the objective of self-supervised next-state prediction in latent space for state feature learning, but observe that such an objective alone is prone to increasing state-feature similarity, and subsequently reducing span. We propose an approach, Regularized Latent Dynamics Prediction (RLDP), that adds a simple orthogonality regularization to maintain feature diversity and can match or surpass state-of-the-art complex representation learning methods for zero-shot RL. Furthermore, we empirically show that prior approaches perform poorly in low-coverage scenarios where RLDP still succeeds.

Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models

Abstract

Behavioral Foundation Models (BFMs) produce agents with the capability to adapt to any unknown reward or task. These methods, however, are only able to produce near-optimal policies for the reward functions that are in the span of some pre-existing state features, making the choice of state features crucial to the expressivity of the BFM. As a result, BFMs are trained using a variety of complex objectives and require sufficient dataset coverage, to train task-useful spanning features. In this work, we examine the question: are these complex representation learning objectives necessary for zero-shot RL? Specifically, we revisit the objective of self-supervised next-state prediction in latent space for state feature learning, but observe that such an objective alone is prone to increasing state-feature similarity, and subsequently reducing span. We propose an approach, Regularized Latent Dynamics Prediction (RLDP), that adds a simple orthogonality regularization to maintain feature diversity and can match or surpass state-of-the-art complex representation learning methods for zero-shot RL. Furthermore, we empirically show that prior approaches perform poorly in low-coverage scenarios where RLDP still succeeds.
Paper Structure (38 sections, 2 theorems, 24 equations, 11 figures, 10 tables, 1 algorithm)

This paper contains 38 sections, 2 theorems, 24 equations, 11 figures, 10 tables, 1 algorithm.

Key Result

Lemma 4.2

Given MDP $\bar{\mathcal{M}}$, let $\pi$ be any $\mathcal{K}_V$-Lipschitz valued policy, $M^\pi$ be the successor measure for $\pi$ and $\bar{M}^\pi$ be the corresponding successor measure on $\bar{ \mathcal{M}}$, $\mathcal{L}_{RLDP}(\phi, g, w)$ upper bounds the prediction error in successor measur

Figures (11)

  • Figure 1: Average Cosine similarity between state-representations sampled uniformly from the training dataset. Feature similarity increases over the course of training; once adding our orthogonality regularizer (with $\lambda = 1$), we obtain more diverse representations. Shaded region shows standard deviation over 4 seeds.
  • Figure 2: RLDP combines latent next state prediction + regularization for diversity (an orthogonality regularizer) to learn representations for BFMs.
  • Figure 3: Pair-wise comparison of RLDP against prior offline representation learning methods using per-task oracle normalized performance differences ($\Delta$ = RLDP – Baseline) in SMPL Humanoid environment. The gray diamond represents the IQM (Interquartile Mean).
  • Figure 4: Pair-wise comparison of RLDP against baseline representation learning methods in low-coverage D4RL dataset. Each point represents $\Delta = R_{\text{RLDP}} - R_{\text{baseline}}$ for a single $\{\text{task}, \text{seed}\}$ pair. The gray diamond represents the IQM (Interquartile Mean).
  • Figure 5: Evaluating the impact of Orthogonality Regularization: We ran one-sided Mann–Whitney U tests on the per-seed returns over 4 seeds to compare different values of the orthogonality regularization, and we observe that adding small orthogonality regularization coefficient $\lambda=0.01$ gives a statistically significant improvement over coefficient $\lambda=0.0$.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Definition 4.1
  • Lemma 4.2
  • Lemma A.0
  • proof