Table of Contents
Fetching ...

CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Yu Yamaguchi, Kohei Watanabe, Shunsuke Aoki, Issei Yamamoto

TL;DR

CoVLA addresses the shortage of large-scale datasets that jointly cover vision, language, and action for autonomous driving. It introduces the CoVLA-Dataset (10,000 real-world scenes, over 80 hours) with automated trajectory labeling and captioning, and the CoVLA-Agent, a VLA model that predicts future trajectories while generating scene descriptions. Experiments show that using ground-truth captions yields more accurate trajectory predictions than predicted captions, demonstrating strong language-action alignment and interpretability. The dataset and CoVLA-Agent establish a scalable, data-driven framework for interpretable end-to-end autonomous driving with rich multimodal supervision.

Abstract

Autonomous driving, particularly navigating complex and unanticipated scenarios, demands sophisticated reasoning and planning capabilities. While Multi-modal Large Language Models (MLLMs) offer a promising avenue for this, their use has been largely confined to understanding complex environmental contexts or generating high-level driving commands, with few studies extending their application to end-to-end path planning. A major research bottleneck is the lack of large-scale annotated datasets encompassing vision, language, and action. To address this issue, we propose CoVLA (Comprehensive Vision-Language-Action) Dataset, an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers. This approach utilizes raw in-vehicle sensor data, allowing it to surpass existing datasets in scale and annotation richness. Using CoVLA, we investigate the driving capabilities of MLLMs that can handle vision, language, and action in a variety of driving scenarios. Our results illustrate the strong proficiency of our model in generating coherent language and action outputs, emphasizing the potential of Vision-Language-Action (VLA) models in the field of autonomous driving. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems by providing a comprehensive platform for training and evaluating VLA models, contributing to safer and more reliable self-driving vehicles. The dataset is released for academic purpose.

CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

TL;DR

CoVLA addresses the shortage of large-scale datasets that jointly cover vision, language, and action for autonomous driving. It introduces the CoVLA-Dataset (10,000 real-world scenes, over 80 hours) with automated trajectory labeling and captioning, and the CoVLA-Agent, a VLA model that predicts future trajectories while generating scene descriptions. Experiments show that using ground-truth captions yields more accurate trajectory predictions than predicted captions, demonstrating strong language-action alignment and interpretability. The dataset and CoVLA-Agent establish a scalable, data-driven framework for interpretable end-to-end autonomous driving with rich multimodal supervision.

Abstract

Autonomous driving, particularly navigating complex and unanticipated scenarios, demands sophisticated reasoning and planning capabilities. While Multi-modal Large Language Models (MLLMs) offer a promising avenue for this, their use has been largely confined to understanding complex environmental contexts or generating high-level driving commands, with few studies extending their application to end-to-end path planning. A major research bottleneck is the lack of large-scale annotated datasets encompassing vision, language, and action. To address this issue, we propose CoVLA (Comprehensive Vision-Language-Action) Dataset, an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers. This approach utilizes raw in-vehicle sensor data, allowing it to surpass existing datasets in scale and annotation richness. Using CoVLA, we investigate the driving capabilities of MLLMs that can handle vision, language, and action in a variety of driving scenarios. Our results illustrate the strong proficiency of our model in generating coherent language and action outputs, emphasizing the potential of Vision-Language-Action (VLA) models in the field of autonomous driving. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems by providing a comprehensive platform for training and evaluating VLA models, contributing to safer and more reliable self-driving vehicles. The dataset is released for academic purpose.
Paper Structure (33 sections, 2 equations, 7 figures, 4 tables)

This paper contains 33 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: CoVLA framework overview. We develop CoVLA-Dataset, a comprehensive dataset for autonomous driving encompassing unique 10,000 video clips, frame-level language captions describing the driving scenarios, and future trajectory actions. We also show CoVLA-Agent, a VLM-based path planning model capable of predicting the future trajectory of the vehicle and providing a textual description of its behavior and reasoning.
  • Figure 2: Overview of the dataset generation pipeline. We automatically label video frames and sensor signals to generate trajectories and other labels. Furthermore, we apply auto-captioning to the video frames to produce both behavior and reasoning captions.
  • Figure 3: Frame examples from CoVLA-Dataset. Estimated trajectories (green line) and captions generated by the captioner model are shown. The key objects are highlighted in blue bold text, while the failures in captions are shown in red bold text.
  • Figure 4: Data distribution of vehicle speed and steering angle. The red bars represent the distribution before sampling, while the yellow bars show the distribution after sampling. Note that a logarithmic scale is used for clarity in (b).
  • Figure 5: The architecture for CoVLA-Agent.
  • ...and 2 more figures