Table of Contents
Fetching ...

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen

TL;DR

VLA-JEPA addresses the core challenge in latent-action pretraining by replacing pixel-reconstruction with leakage-free latent state prediction, ensuring supervision comes from future context without input leakage. By leveraging a JEPA-style objective and a latent world model, it learns action-relevant state transitions that are robust to camera motion and background changes. The approach unifies pretraining on human and robot videos into a single pipeline and demonstrates strong generalization and robustness across LIBERO, SimplerEnv, LIBERO-Plus, and real-world Franka tasks. The results indicate practical gains in transferability and stability, while reducing the training complexity of prior multi-stage latent-action frameworks.

Abstract

Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is \emph{leakage-free state prediction}: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

TL;DR

VLA-JEPA addresses the core challenge in latent-action pretraining by replacing pixel-reconstruction with leakage-free latent state prediction, ensuring supervision comes from future context without input leakage. By leveraging a JEPA-style objective and a latent world model, it learns action-relevant state transitions that are robust to camera motion and background changes. The approach unifies pretraining on human and robot videos into a single pipeline and demonstrates strong generalization and robustness across LIBERO, SimplerEnv, LIBERO-Plus, and real-world Franka tasks. The results indicate practical gains in transferability and stability, while reducing the training complexity of prior multi-stage latent-action frameworks.

Abstract

Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is \emph{leakage-free state prediction}: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.
Paper Structure (19 sections, 9 equations, 7 figures, 6 tables)

This paper contains 19 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: VLA-JEPA model architecture
  • Figure 2: VLA-JEPA supports cross-domain training on both human videos and robot data, where human videos are trained using an alignment loss under the latent world modeling objective, while robot data are trained with a joint objective consisting of an alignment loss and a robot action prediction loss.
  • Figure 3: Experiments setup on LIBERO, LIBERO-Plus, SimplerEnv and real-world Franka robot. We evaluate VLA-JEPA on 3 simulation benchmarks and 1 real-world environment.
  • Figure 4: Real World Experimental Results
  • Figure 5: Effect of the proportion of human video data in pre-training on success rates across different perturbation dimensions on the LIBERO-Plus benchmark.
  • ...and 2 more figures