Seeing Space and Motion: Enhancing Latent Actions with Spatial and Dynamic Awareness for VLA
Zhejia Cai, Yandan Yang, Xinyuan Chang, Shiyi Liang, Ronghan Chen, Feng Xiong, Mu Xu, Ruqi Huang
TL;DR
This work tackles two core weaknesses of Latent Action Models in Vision-Language-Action systems: weak spatial reasoning and limited long-horizon temporal perception. It introduces Farsighted-LAM, which uses geometry-aware spatial encoding from DINOv2 features and multi-frame temporal modeling to forecast latent actions, and SSM-VLA, a cascaded VLA framework incorporating VisualCoT and a diffusion-based action generator with a Multi-modal Synergistic Attention mechanism. The approach achieves state-of-the-art performance on CALVIN ABC-D and demonstrates robust real-world deployment with cross-embodiment generalization, supported by extensive ablations confirming the value of geometric priors, temporal coherence, and explicit reasoning. Collectively, the method advances robustness, interpretability, and generalizability for embodied agents operating across diverse environments and hardware.
Abstract
Latent Action Models (LAMs) enable Vision-Language-Action (VLA) systems to learn semantic action representations from large-scale unannotated data. Yet, we identify two bottlenecks of LAMs: 1) the commonly adopted end-to-end trained image encoder suffers from poor spatial understanding; 2) LAMs can be fragile when input frames are distant, leading to limited temporal perception. Such factors inevitably hinder stable and clear action modeling. To this end, we propose Farsighted-LAM, a latent action framework with geometry-aware spatial encoding and multi-scale temporal modeling, capturing structural priors and dynamic motion patterns from consecutive frames. We further propose SSM-VLA, an end-to-end VLA framework built upon Farsighted-LAM, which integrates structured perception with a visual Chain-of-Thought module to explicitly reason about environmental dynamics, enhancing decision consistency and interpretability. We validate SSM-VLA on multiple VLA tasks in both simulation and real-world settings, and achieve state-of-the-art performance. Our results demonstrate that our strategy of combining geometry-aware modeling, temporal coherence, and explicit reasoning is effective in enhancing the robustness and generalizability of embodied intelligence.
