Table of Contents
Fetching ...

Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation

Jisoo Kim, Jungbin Cho, Sanghyeok Chu, Ananya Bal, Jinhyung Kim, Gunhee Lee, Sihaeng Lee, Seung Hwan Kim, Bohyung Han, Hyunmin Lee, Laszlo A. Jeni, Seungryong Kim

TL;DR

Pri4R is introduced, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training, and shows that 3D point track prediction is an effective supervision target for learning action-world dynamics.

Abstract

Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations.

Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation

TL;DR

Pri4R is introduced, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training, and shows that 3D point track prediction is an effective supervision target for learning action-world dynamics.

Abstract

Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations.
Paper Structure (32 sections, 5 equations, 9 figures, 15 tables)

This paper contains 32 sections, 5 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 2: Overview of Pri4R. We augment two common VLA architectures with an auxiliary point track head that predicts per-step 3D point displacements $\widehat{\Delta P}_{t:t+H}$ from backbone embeddings $\mathbf{z}_t$ and the current point set $P_t$. (a) For backbone-centric VLAs (e.g., OpenVLA-OFT kim2024openvla), we set $\mathbf{z}_t$ to the final layer action-query token embeddings. (b) For expert-style VLAs (e.g., $\pi$pi_0intelligence2025pi_), we condition an embedding module on the backbone’s final layer hidden states to produce $\mathbf{z}_t$. (c) The point track head encodes $P_t$ with a PointMLP, then fuses the resulting point features with $\mathbf{z}_t$ via a FusionMLP to predict future point tracks. Privileged 3D point track supervision during training forces the VLA to model how scene geometry evolves, yielding more reliable interaction and higher task success, while leaving the test-time interface and compute completely unchanged.
  • Figure 3: Training dynamics. Pri4R learns slowly at the early stage due to the 3D point track objective, but improves performance rapidly, reaching the baseline peak $2.7\times$ faster.
  • Figure 4: Predicted point tracks in simulation and the real world. Future trajectories are visualized in a rainbow color map. As highlighted in the red boxes, Pri4R accurately predicts point tracks for both scene elements and the robot.
  • Figure 5: Real-world setup. Blue boxes indicate the target. In Pick the farthest object, the red box marks a closer distractor object; in Pick up the doll and place in the white bin, the red box marks a distractor.
  • Figure 6: Qualitative comparison across tasks. Baseline (left) failures vs. Our method (right) successes.
  • ...and 4 more figures