LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

Yuechen Luo; Fang Li; Shaoqing Xu; Yang Ji; Zehan Zhang; Bing Wang; Yuannan Shen; Jianwei Cui; Long Chen; Guang Chen; Hangjun Ye; Zhi-Xin Yang; Fuxi Wen

LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, Fuxi Wen

TL;DR

The Latent Spatio-Temporal VLA (LaST-VLA) is proposed, a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT that excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.

Abstract

While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.

LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

TL;DR

Abstract

Paper Structure (22 sections, 11 equations, 14 figures, 9 tables)

This paper contains 22 sections, 11 equations, 14 figures, 9 tables.

Introduction
Related Work
VLA models in Autonomous Driving
Latent Chain-of-Thought
Method
Preliminaries
Latent Spatio-Temporal CoT
Progressive Two-Stage SFT Strategy
Latent-Grounded Trajectory Refinement via GRPO
Experiment
Implementation details
Performance Comparison
Ablation Studies
Qualitative Results
Conclusion
...and 7 more sections

Figures (14)

Figure 1: Comparison of VLA Paradigms.(a) Direct VLA is efficient but lacks reasoning. (b) Explicit Textual CoT is interpretable but suffers high latency and hallucinations. (c) Naive Latent CoT (w/o supervision) is efficient but unstable (model collapse). (d) Our Spatio-Temporal Latent CoT (supervision) aligns latent features with physical priors, achieving efficiency, stability, and grounding.
Figure 2: Overview of the LaST-VLA framework.(a) Model Architecture: The model constructs a Latent CoT by aligning hidden states with dynamic and geometric priors distilled from foundation models (Cosmos and VGGT) via specialized adapters. (b) Progressive Training Strategy: The pipeline features a two-stage SFT phase that utilizes structured causal masking to enforce physically grounded reasoning, followed by RL fine-tuning to directly optimize the policy for driving safety and compliance.
Figure 3: Architecture of the Dynamics (a) and Geometry (b) Adapters. Random mask is used only during training.
Figure 4: PDMS performance varying with training steps during the RL phase.
Figure 5: Qualitative visualization comparing the Textual CoT baseline (Red) and LaST-VLA (Green). (a) Drivable Area Compliance (DAC): Our method maintains precise lane adherence, whereas the baseline violates spatial boundaries. (b) Time-to-Collision (TTC): Our method accurately anticipates dynamics to avoid rear-end collisions, while the baseline fails to brake effectively.
...and 9 more figures

LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

TL;DR

Abstract

LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (14)