Table of Contents
Fetching ...

Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA

Zhejia Cai, Yandan Yang, Xinyuan Chang, Shiyi Liang, Ronghan Chen, Feng Xiong, Mu Xu, Ruqi Huang

TL;DR

This work tackles two core weaknesses of Latent Action Models in Vision-Language-Action systems: weak spatial reasoning and limited long-horizon temporal perception. It introduces Farsighted-LAM, which uses geometry-aware spatial encoding from DINOv2 features and multi-frame temporal modeling to forecast latent actions, and SSM-VLA, a cascaded VLA framework incorporating VisualCoT and a diffusion-based action generator with a Multi-modal Synergistic Attention mechanism. The approach achieves state-of-the-art performance on CALVIN ABC-D and demonstrates robust real-world deployment with cross-embodiment generalization, supported by extensive ablations confirming the value of geometric priors, temporal coherence, and explicit reasoning. Collectively, the method advances robustness, interpretability, and generalizability for embodied agents operating across diverse environments and hardware.

Abstract

Latent Action Models (LAMs) enable Vision-Language-Action (VLA) systems to learn semantic action representations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are distant, leading to limited temporal perception. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry-aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end-to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real-world settings, and achieve state-of-the-art performance. Our results demonstrate that our strategy of combining geometry-aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.

Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA

TL;DR

This work tackles two core weaknesses of Latent Action Models in Vision-Language-Action systems: weak spatial reasoning and limited long-horizon temporal perception. It introduces Farsighted-LAM, which uses geometry-aware spatial encoding from DINOv2 features and multi-frame temporal modeling to forecast latent actions, and SSM-VLA, a cascaded VLA framework incorporating VisualCoT and a diffusion-based action generator with a Multi-modal Synergistic Attention mechanism. The approach achieves state-of-the-art performance on CALVIN ABC-D and demonstrates robust real-world deployment with cross-embodiment generalization, supported by extensive ablations confirming the value of geometric priors, temporal coherence, and explicit reasoning. Collectively, the method advances robustness, interpretability, and generalizability for embodied agents operating across diverse environments and hardware.

Abstract

Latent Action Models (LAMs) enable Vision-Language-Action (VLA) systems to learn semantic action representations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are distant, leading to limited temporal perception. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry-aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end-to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real-world settings, and achieve state-of-the-art performance. Our results demonstrate that our strategy of combining geometry-aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.

Paper Structure

This paper contains 26 sections, 14 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of the end-to-end causal reasoning pipeline within a single SSM-VLA model, encompassing three core stages: 1) Future observation prediction, generating a visual chain-of-thought to enable interpretable and temporally coherent reasoning; 2) Farsighted latent action modeling, integrating spatial and temporal dynamics for effective long-horizon policy planning; 3) Modular action chunk prediction, supporting cross-platform generalization across diverse robotic embodiments. Experiments on both real-world robotic platforms and simulated environments demonstrate SSM-VLA’s robustness and practical effectiveness.
  • Figure 2: Architecture of our Farsighted Latent Action Model. The encoder takes DINOv2 features of the current frame $s_t$ and multiple future keyframes to predict a sequence of latent actions. The decoder then uses the current frame $s_t$ and a quantized latent action $z_{t+k}$ to reconstruct the corresponding future frame $\hat{s}_{t+k}$.
  • Figure 3: The Three-Stage Cascaded VLA Policy. Stage 1 predicts the immediate future observation $\hat{s}_{t+k}$. Stage 2 infers a long-horizon latent action plan $\{\hat{z}_{t+k}\}_{k=1}^{N}$. Stage 3 fuses all information to produce the final executable action $a_t$.
  • Figure 4: Visualization of simulation evaluation tasks. We visualize the simulation results of three different tasks, which demonstrates success of our model in multi-task learning.
  • Figure 5: Visualization of the real world experiments. The model is asked to place the pink ball into the box. We show two samples with different layouts and chaos background.
  • ...and 1 more figures