HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

Yiru Wang; Zichong Gu; Yu Gao; Anqing Jiang; Zhigang Sun; Shuo Wang; Yuwen Heng; Hao Sun

HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

Yiru Wang, Zichong Gu, Yu Gao, Anqing Jiang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun

TL;DR

HighST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation, enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting and fuses redundant tokens rather than filtering them.

Abstract

Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding. However, their utilization in safety-critical scenarios is constrained by inherent limitations, including imprecise numerical reasoning, weak 3D spatial awareness, and high sensitivity to context. To address these challenges, we propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation. Our framework enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting. To ensure computational efficiency, we integrate dynamic token sparsification into the VLA architecture. This approach fuses redundant tokens rather than filtering them, effectively reducing redundancy without sacrificing model performance. Furthermore, we employ a hierarchical transformer-based planner to progressively refine coarse VLA waypoints into fine-grained trajectories. Crucially, the planner utilizes dynamic latent regularization to incorporate language commands, ensuring strict spatial grounding and temporal coherence. Extensive evaluation on the NAVSIM v2 benchmark demonstrates state-of-the-art performance on Navtest, achieving an EPDMS of 88.6, and EPDMS of 50.9 on pseudo closed-loop Navhard benchmark.

HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

TL;DR

Abstract

Paper Structure (25 sections, 1 equation, 3 figures, 5 tables)

This paper contains 25 sections, 1 equation, 3 figures, 5 tables.

Introduction
Related Work
E2E Autonomous Driving
VLM for E2E Autonomous Driving
VLA in E2E Autonomous Driving
Method
HiST-VLA Framework
Spatio-Temporal VLA Model
Spatially-Aware Visual Encoding
Dynamic Token Sparser
Temporal State Modeling
Granular Meta-Action and Confidence-Aware Trajectory Reasoning Enhanced CoT
Transformer-based Hierarchical Planner
Training Strategy
Experiments
...and 10 more sections

Figures (3)

Figure 1: The Framework of the proposed HiST-VLA, including the Spatio-Temporal VLA architecture, hierarchical planner and main training stages.
Figure 2: Spatio-Temporal VLA Model: Leveraging multi-view images, long-term navigation information, and ego state, our model employs CoT reasoning and dynamic token sparsification to produce a granular driving command and a coarse trajectory with confidence for the next four seconds.
Figure 3: Trajectory visualization on Navhard across: (a) stage 1 real-world and (b) stage 2 synthetic scenes. Driving commands are shown above each scenario. Rows depict, from top to bottom: VLA coarse trajectory, semantics-aligned trajectory, and HiST-VLA trajectory.

HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

TL;DR

Abstract

HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (3)