Table of Contents
Fetching ...

HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

Yiru Wang, Zichong Gu, Yu Gao, Anqing Jiang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun

TL;DR

HighST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation, enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting and fuses redundant tokens rather than filtering them.

Abstract

Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding. However, their utilization in safety-critical scenarios is constrained by inherent limitations, including imprecise numerical reasoning, weak 3D spatial awareness, and high sensitivity to context. To address these challenges, we propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation. Our framework enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting. To ensure computational efficiency, we integrate dynamic token sparsification into the VLA architecture. This approach fuses redundant tokens rather than filtering them, effectively reducing redundancy without sacrificing model performance. Furthermore, we employ a hierarchical transformer-based planner to progressively refine coarse VLA waypoints into fine-grained trajectories. Crucially, the planner utilizes dynamic latent regularization to incorporate language commands, ensuring strict spatial grounding and temporal coherence. Extensive evaluation on the NAVSIM v2 benchmark demonstrates state-of-the-art performance on Navtest, achieving an EPDMS of 88.6, and EPDMS of 50.9 on pseudo closed-loop Navhard benchmark.

HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving

TL;DR

HighST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation, enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting and fuses redundant tokens rather than filtering them.

Abstract

Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding. However, their utilization in safety-critical scenarios is constrained by inherent limitations, including imprecise numerical reasoning, weak 3D spatial awareness, and high sensitivity to context. To address these challenges, we propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation. Our framework enhances 3D spatial and temporal reasoning by integrating geometric awareness with fine-grained driving commands and state history prompting. To ensure computational efficiency, we integrate dynamic token sparsification into the VLA architecture. This approach fuses redundant tokens rather than filtering them, effectively reducing redundancy without sacrificing model performance. Furthermore, we employ a hierarchical transformer-based planner to progressively refine coarse VLA waypoints into fine-grained trajectories. Crucially, the planner utilizes dynamic latent regularization to incorporate language commands, ensuring strict spatial grounding and temporal coherence. Extensive evaluation on the NAVSIM v2 benchmark demonstrates state-of-the-art performance on Navtest, achieving an EPDMS of 88.6, and EPDMS of 50.9 on pseudo closed-loop Navhard benchmark.
Paper Structure (25 sections, 1 equation, 3 figures, 5 tables)

This paper contains 25 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The Framework of the proposed HiST-VLA, including the Spatio-Temporal VLA architecture, hierarchical planner and main training stages.
  • Figure 2: Spatio-Temporal VLA Model: Leveraging multi-view images, long-term navigation information, and ego state, our model employs CoT reasoning and dynamic token sparsification to produce a granular driving command and a coarse trajectory with confidence for the next four seconds.
  • Figure 3: Trajectory visualization on Navhard across: (a) stage 1 real-world and (b) stage 2 synthetic scenes. Driving commands are shown above each scenario. Rows depict, from top to bottom: VLA coarse trajectory, semantics-aligned trajectory, and HiST-VLA trajectory.