Table of Contents
Fetching ...

Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models

Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, Leichen Wang, Xingtao Hu, Hao Sun, Hang Zhao, Hao Zhao

TL;DR

The paper introduces Impromptu VLA, a large, open dataset of ~80k unstructured-road driving clips derived from 2M+ sources to train Vision-Language-Action models. It establishes a four-category taxonomy of unstructured scenarios and a planning-oriented multi-task QA annotation framework, validated through extensive human checks. Empirical results show that pretraining on Impromptu VLA improves closed-loop NeuroNCAP performance and open-loop nuScenes trajectory accuracy, while the QA suite provides diagnostic insight into perception, prediction, and planning gains. By providing open data, code, and models, the work aims to advance robust VLA-based driving in real-world, unstructured environments.

Abstract

Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks--improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.

Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models

TL;DR

The paper introduces Impromptu VLA, a large, open dataset of ~80k unstructured-road driving clips derived from 2M+ sources to train Vision-Language-Action models. It establishes a four-category taxonomy of unstructured scenarios and a planning-oriented multi-task QA annotation framework, validated through extensive human checks. Empirical results show that pretraining on Impromptu VLA improves closed-loop NeuroNCAP performance and open-loop nuScenes trajectory accuracy, while the QA suite provides diagnostic insight into perception, prediction, and planning gains. By providing open data, code, and models, the work aims to advance robust VLA-based driving in real-world, unstructured environments.

Abstract

Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks--improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.

Paper Structure

This paper contains 16 sections, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Visual Abstract of Impromptu VLA. We construct Impromptu VLA Dataset, which contains over 80K clips curated from 8 open-sourced datasets, focusing on four critical types of unstructured "corner case" scenarios that challenge current autonomous driving vehicles. It supports interconnected VLA tasks including scene understanding, prediction, meta planning and trajectory planning. Key experimental results demonstrates that VLA models trained with Impromptu VLA Dataset achieve significant performance improvements in both closed-loop and open-loop metrics.
  • Figure 2: Characteristics comparison of different driving scene datasets. Figure (a) illustrates the distribution of scene categories across various datasets and the number of video clips contained in each dataset, providing a direct view of the emphasis on different scene types and the data scale of each dataset. Figure (b) compares the trajectory distribution in the original with the trajectory distribution in our constructed dataset, explaining the trajectory diversity of our dataset. Figure (c) shows examples of different scene categories from 8 source datasets. Notably, the IDD dataset lacks data for the "Challenging Road Conditions" category.
  • Figure 3: Data Processing and Annotation Pipeline for the Impromptu VLA Dataset. The diagram outlines the sequential process for creating our dataset, starting from raw data collection and scenario taxonomy definition (Sec. \ref{['sec:Taxonomy']}, through frequency alignment and keyclip selection, to multi-task annotation generation via Qwen2.5-VL (including scene description, object/feature analysis, and labeling), and concluding with rigorous human verification (Sec. \ref{['sec:task']}).
  • Figure 4: Open-loop trajectory prediction L2 errors (m) on the nuScenes dataset. (where 1 indicates sourced from qiao2025lightemma, 2 indicates sourced from xing2025openemma and 3 indicates sourced from hwang2024emma). Best results within each category are in bold, second best are underlined.
  • Figure 5: NeuroNCAP performance in challenging scenarios. This figure compares the driving behavior of the two models in three representative challenging scenarios: static, frontal, and side. For each scenario, the left column shows the behavior of the base model, which is fine-tuned on nuScenes. The right column shows the performance of the model trained on a subset of our proposed dataset and then fine-tuned on nuScenes. Compared to the base model, the model using our data can better avoid vehicles by turning, slowing down, etc.
  • ...and 11 more figures