Table of Contents
Fetching ...

DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model

Yuqi Wang, Ke Cheng, Jiawei He, Qitai Wang, Hengchen Dai, Yuntao Chen, Fei Xia, Zhaoxiang Zhang

TL;DR

The paper introduces DrivingDojo, a large-scale driving video dataset crafted to train interactive world models capable of handling complete ego maneuvers, multi-agent interactions, and open-world knowledge. It formalizes an action instruction following (AIF) benchmark to evaluate action-conditioned future predictions and shows that models trained on DrivingDojo achieve higher visual fidelity and stronger action-following, including zero-shot transfer to new datasets. The dataset comprises three subsets—DrivingDojo-Action, DrivingDojo-Interplay, and DrivingDojo-Open—collected from Meituan’s fleet, totaling about 18k videos and 7,500 hours with careful curation and privacy safeguards. The work also discusses limitations such as hallucinations and short-horizon predictions, and outlines future directions toward longer-horizon world modeling and policy evaluation, while considering societal impacts and licensing considerations.

Abstract

Driving world models have gained increasing attention due to their ability to model complex physical dynamics. However, their superb modeling capability is yet to be fully unleashed due to the limited video diversity in current driving datasets. We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge, laying a stepping stone for future world model development. We further define an action instruction following (AIF) benchmark for world models and demonstrate the superiority of the proposed dataset for generating action-controlled future predictions.

DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model

TL;DR

The paper introduces DrivingDojo, a large-scale driving video dataset crafted to train interactive world models capable of handling complete ego maneuvers, multi-agent interactions, and open-world knowledge. It formalizes an action instruction following (AIF) benchmark to evaluate action-conditioned future predictions and shows that models trained on DrivingDojo achieve higher visual fidelity and stronger action-following, including zero-shot transfer to new datasets. The dataset comprises three subsets—DrivingDojo-Action, DrivingDojo-Interplay, and DrivingDojo-Open—collected from Meituan’s fleet, totaling about 18k videos and 7,500 hours with careful curation and privacy safeguards. The work also discusses limitations such as hallucinations and short-horizon predictions, and outlines future directions toward longer-horizon world modeling and policy evaluation, while considering societal impacts and licensing considerations.

Abstract

Driving world models have gained increasing attention due to their ability to model complex physical dynamics. However, their superb modeling capability is yet to be fully unleashed due to the limited video diversity in current driving datasets. We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge, laying a stepping stone for future world model development. We further define an action instruction following (AIF) benchmark for world models and demonstrate the superiority of the proposed dataset for generating action-controlled future predictions.

Paper Structure

This paper contains 62 sections, 3 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Examples on DrivingDojo. (a) showcases various driving actions, such as lane changes, abrupt braking at traffic control, and turning at intersections. (b) illustrates the ego-car's interactions with other dynamic agents, including cutting-in and cutting-off maneuvers. (c) displays encounters with rolling or falling objects, moving or floating unknown objects, and interactions with traffic lights and boom barriers. (d) presents diverse cases encountered in real-world driving scenarios.
  • Figure 2: Enhancing interactive and knowledge-enriched learning of world models. Data plays a crucial role in modeling the world. DrivingDojo is a large-scale video dataset curated from millions of daily collected videos, designed to investigate real-world visual interactions. DrivingDojo features comprehensive actions, multi-agent interplay, and rich open-world driving knowledge, serving as a superb platform for studying driving world models.
  • Figure 3: The strengths of the DrivingDojo dataset. (a) illustrates a comparison of action distributions among nuScenes, ONCE, and our DrivingDojo. We compare the average hourly event counts of driving actions. (b) presents the distribution of text descriptions for the video clips in DrivingDojo.
  • Figure 4: Descriptive statistics of the DrivingDojo dataset. The dataset was collected from various regions across China, including nighttime and rainy/snowy conditions.
  • Figure 5: Predicting multiple futures based on different actions. Left: going straight, turning left, and turning right at a crossing; Right: changing to the left lane, staying in the current lane, and changing to the right lane.
  • ...and 11 more figures