Table of Contents
Fetching ...

DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models

Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, Xianpeng Lang

TL;DR

DriveAction tackles the misalignment between existing autonomous driving benchmarks and real human decisions by introducing an action-driven VLA benchmark built from driver-contributed scenarios. It couples high-level, real-time action labels with a tree-structured evaluation that directly links vision, language, and action tasks, enabling both comprehensive and task-specific assessments. The study analyzes twelve VLMs and two driving-domain models, showing that both visual and linguistic cues improve action predictions and that MOE-based on-vehicle models achieve competitive performance relative to generalist models. This benchmark provides a rigorous, scalable foundation for diagnosing bottlenecks and guiding development toward more human-like driving decisions.

Abstract

Vision-Language-Action (VLA) models have advanced autonomous driving, but existing benchmarks still lack scenario diversity, reliable action-level annotation, and evaluation protocols aligned with human preferences. To address these limitations, we introduce DriveAction, the first action-driven benchmark specifically designed for VLA models, comprising 16,185 QA pairs generated from 2,610 driving scenarios. DriveAction leverages real-world driving data proactively collected by drivers of autonomous vehicles to ensure broad and representative scenario coverage, offers high-level discrete action labels collected directly from drivers' actual driving operations, and implements an action-rooted tree-structured evaluation framework that explicitly links vision, language, and action tasks, supporting both comprehensive and task-specific assessment. Our experiments demonstrate that state-of-the-art vision-language models (VLMs) require both vision and language guidance for accurate action prediction: on average, accuracy drops by 3.3% without vision input, by 4.1% without language input, and by 8.0% without either. Our evaluation supports precise identification of model bottlenecks with robust and consistent results, thus providing new insights and a rigorous foundation for advancing human-like decisions in autonomous driving.

DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models

TL;DR

DriveAction tackles the misalignment between existing autonomous driving benchmarks and real human decisions by introducing an action-driven VLA benchmark built from driver-contributed scenarios. It couples high-level, real-time action labels with a tree-structured evaluation that directly links vision, language, and action tasks, enabling both comprehensive and task-specific assessments. The study analyzes twelve VLMs and two driving-domain models, showing that both visual and linguistic cues improve action predictions and that MOE-based on-vehicle models achieve competitive performance relative to generalist models. This benchmark provides a rigorous, scalable foundation for diagnosing bottlenecks and guiding development toward more human-like driving decisions.

Abstract

Vision-Language-Action (VLA) models have advanced autonomous driving, but existing benchmarks still lack scenario diversity, reliable action-level annotation, and evaluation protocols aligned with human preferences. To address these limitations, we introduce DriveAction, the first action-driven benchmark specifically designed for VLA models, comprising 16,185 QA pairs generated from 2,610 driving scenarios. DriveAction leverages real-world driving data proactively collected by drivers of autonomous vehicles to ensure broad and representative scenario coverage, offers high-level discrete action labels collected directly from drivers' actual driving operations, and implements an action-rooted tree-structured evaluation framework that explicitly links vision, language, and action tasks, supporting both comprehensive and task-specific assessment. Our experiments demonstrate that state-of-the-art vision-language models (VLMs) require both vision and language guidance for accurate action prediction: on average, accuracy drops by 3.3% without vision input, by 4.1% without language input, and by 8.0% without either. Our evaluation supports precise identification of model bottlenecks with robust and consistent results, thus providing new insights and a rigorous foundation for advancing human-like decisions in autonomous driving.

Paper Structure

This paper contains 26 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Action-Rooted Tree-Structured Task Architecture in DriveAction
  • Figure 2: Distribution of QA Pairs Across Tasks in DriveAction
  • Figure 3: Example of the V-L-A Pipeline in Traffic Sign Task
  • Figure 4: Effect of Navigation Information Design on Model Decision Evaluation
  • Figure 5: Model Performance (%) on Action Across Task Categories