Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models

Leonardo F. Toso; Davit Shadunts; Yunyang Lu; Nihal Sharma; Donglin Zhan; Nam H. Nguyen; James Anderson

Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models

Leonardo F. Toso, Davit Shadunts, Yunyang Lu, Nihal Sharma, Donglin Zhan, Nam H. Nguyen, James Anderson

TL;DR

The paper tackles brittle planning under visual distribution shifts in image‑based world models by adding a bisimulation encoder on top of fixed pretrained visual features. This induces invariant, control‑relevant latent dynamics via a jointly learned bisimulation objective and a PCA‑regularized VICReg to avoid collapse, enabling a compact latent space (about 10× smaller than DINO‑WM) that supports robust planning with MPC/CEM. The approach remains effective across different pretrained backbones (DINOv2, SimDINOv2, iBOT) and does not require reward supervision, with theoretical guarantees showing a reward‑free generalization bound that ties planning performance to the on‑policy bisimulation distance. Empirically, the method yields strong robustness to backgrounds and moving distractors on PointMaze, outperforming DINO‑WM and DR baselines, and demonstrating the practicality of invariant latent representations for planning in high‑dimensional vision tasks.

Abstract

World models learned from high-dimensional visual observations allow agents to make decisions and plan directly in latent space, avoiding pixel-level reconstruction. However, recent latent predictive architectures (JEPAs), including the DINO world model (DINO-WM), display a degradation in test time robustness due to their sensitivity to "slow features". These include visual variations such as background changes and distractors that are irrelevant to the task being solved. We address this limitation by augmenting the predictive objective with a bisimulation encoder that enforces control-relevant state equivalence, mapping states with similar transition dynamics to nearby latent states while limiting contributions from slow features. We evaluate our model on a simple navigation task under different test-time background changes and visual distractors. Across all benchmarks, our model consistently improves robustness to slow features while operating in a reduced latent space, up to 10x smaller than that of DINO-WM. Moreover, our model is agnostic to the choice of pretrained visual encoder and maintains robustness when paired with DINOv2, SimDINOv2, and iBOT features.

Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models

TL;DR

Abstract

Paper Structure (34 sections, 2 theorems, 24 equations, 12 figures, 4 tables)

This paper contains 34 sections, 2 theorems, 24 equations, 12 figures, 4 tables.

Introduction
Related Work
Preliminaries
Joint-Embedding Predictive Architectures
Bisimulation Metric
Learning Invariant Representations
Our World Model
Pretrained Visual Encoders
Bisimulation Encoder
PCA-based Variance Covariance Regularization
Transition Model
Planning
A Generalization Bound
Experiments
A Simple Navigation Task
...and 19 more sections

Key Result

Theorem 4.1

Suppose $h_\eta(\cdot)$$\varepsilon$-approximately preserves the reward-free bisimulation metric in the sense that In addition, suppose that $h_\eta(\cdot)$ is uniformly bounded as $\left\|h_\eta(z)\right\|_2\leq {H}_w$ for all $z$ and thus $\left\|w_g\right\|_2\leq {H}_w$. Then for any two pretrained visual latent embeddings $z,z'$,

Figures (12)

Figure 1: Visually distinct observations that differ only in background (checkerboard and gradient) are first mapped to latent embeddings $Z$ and $Z'$ by a pretrained encoder (initial state in green, goal in red). A bisimulation encoder then projects these into lower-dimensional representations $W$ and $W'$, which are equivalent under on-policy transition dynamics. In contrast, an observation with different underlying dynamics is mapped to $\tilde{W}$ and separated in the bisimulation space.
Figure 2: Left. Model architecture and training objectives. Visual observations are encoded using a frozen pretrained visual encoder, followed by a bisimulation encoder that maps features into a low-dimensional, control-relevant latent space. The bisimulation loss is trained jointly with the latent transition model, enforcing invariance to task-irrelevant visual features. Right. Rollouts for PointMaze navigation under a background change at test time. For clarity, the background change is not depicted in the figure. See Section \ref{['sec:backgrounds']} for details on background changes. While DINO-WM fails to reach the goal due to background change our model succeeds.
Figure 3: First principal component (PC1) of latent embeddings produced by different visual feature encoders for a PointMaze observation. In all cases, PC1 captures a large percentage of the total variance and predominantly encodes background and layout information rather than control-relevant features, motivating our PCA-based VICReg in the bisimulation encoder.
Figure 4: The six different scenarios that we use to measure robustness based on PointMaze with checkerboard background. From left to right, these are: NC: No Change, SC: Slight background Change , C: Color gradient background, LC: Large Color background Change, LCG: Large Color Gradient background change, and D: moving Distractors with yellow and magenta dots.
Figure 5: Overview of the DINO-Bisim architecture.
...and 7 more figures

Theorems & Definitions (4)

Remark 3.1: Slow Features
Definition 3.2: On-Policy Bisimulation Metric castro2020scalable
Theorem 4.1: Reward-free planning generalization bound
Lemma 1.1: Kantorovich-Rubinstein duality villani2008optimal

Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models

TL;DR

Abstract

Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (4)