Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models
Leonardo F. Toso, Davit Shadunts, Yunyang Lu, Nihal Sharma, Donglin Zhan, Nam H. Nguyen, James Anderson
TL;DR
The paper tackles brittle planning under visual distribution shifts in image‑based world models by adding a bisimulation encoder on top of fixed pretrained visual features. This induces invariant, control‑relevant latent dynamics via a jointly learned bisimulation objective and a PCA‑regularized VICReg to avoid collapse, enabling a compact latent space (about 10× smaller than DINO‑WM) that supports robust planning with MPC/CEM. The approach remains effective across different pretrained backbones (DINOv2, SimDINOv2, iBOT) and does not require reward supervision, with theoretical guarantees showing a reward‑free generalization bound that ties planning performance to the on‑policy bisimulation distance. Empirically, the method yields strong robustness to backgrounds and moving distractors on PointMaze, outperforming DINO‑WM and DR baselines, and demonstrating the practicality of invariant latent representations for planning in high‑dimensional vision tasks.
Abstract
World models learned from high-dimensional visual observations allow agents to make decisions and plan directly in latent space, avoiding pixel-level reconstruction. However, recent latent predictive architectures (JEPAs), including the DINO world model (DINO-WM), display a degradation in test time robustness due to their sensitivity to "slow features". These include visual variations such as background changes and distractors that are irrelevant to the task being solved. We address this limitation by augmenting the predictive objective with a bisimulation encoder that enforces control-relevant state equivalence, mapping states with similar transition dynamics to nearby latent states while limiting contributions from slow features. We evaluate our model on a simple navigation task under different test-time background changes and visual distractors. Across all benchmarks, our model consistently improves robustness to slow features while operating in a reduced latent space, up to 10x smaller than that of DINO-WM. Moreover, our model is agnostic to the choice of pretrained visual encoder and maintains robustness when paired with DINOv2, SimDINOv2, and iBOT features.
