WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving

Yiheng Li; Cunxin Fan; Chongjian Ge; Zhihao Zhao; Chenran Li; Chenfeng Xu; Huaxiu Yao; Masayoshi Tomizuka; Bolei Zhou; Chen Tang; Mingyu Ding; Wei Zhan

WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving

Yiheng Li, Cunxin Fan, Chongjian Ge, Zhihao Zhao, Chenran Li, Chenfeng Xu, Huaxiu Yao, Masayoshi Tomizuka, Bolei Zhou, Chen Tang, Mingyu Ding, Wei Zhan

TL;DR

WOMD-Reasoning introduces a large-scale, multi-modal dataset focused on traffic-rule-induced and human-intention interactions in driving, built automatically atop WOMD using rule-based translations and GPT-4 prompts. It demonstrates Motion-LLaVA, a motion-language model fine-tuned on the dataset, achieving strong performance in interaction prediction, traffic-rule compliant planning, and driving-related Q&A. The work further extends to a vision-enabled extension via simulations and shows language-augmented trajectory prediction improves downstream tasks. Together, the dataset and methods enable richer interaction reasoning and improved safety-focused planning for autonomous driving systems.

Abstract

Language models uncover unprecedented abilities in analyzing driving scenarios, owing to their limitless knowledge accumulated from text-based pre-training. Naturally, they should particularly excel in analyzing rule-based interactions, such as those triggered by traffic laws, which are well documented in texts. However, such interaction analysis remains underexplored due to the lack of dedicated language datasets that address it. Therefore, we propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a comprehensive large-scale Q&As dataset built on WOMD focusing on describing and reasoning traffic rule-induced interactions in driving scenarios. WOMD-Reasoning also presents by far the largest multi-modal Q&A dataset, with 3 million Q&As on real-world driving scenarios, covering a wide range of driving topics from map descriptions and motion status descriptions to narratives and analyses of agents' interactions, behaviors, and intentions. To showcase the applications of WOMD-Reasoning, we design Motion-LLaVA, a motion-language model fine-tuned on WOMD-Reasoning. Quantitative and qualitative evaluations are performed on WOMD-Reasoning dataset as well as the outputs of Motion-LLaVA, supporting the data quality and wide applications of WOMD-Reasoning, in interaction predictions, traffic rule compliance plannings, etc. The dataset and its vision modal extension are available on https://waymo.com/open/download/. The codes & prompts to build it are available on https://github.com/yhli123/WOMD-Reasoning.

WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving

TL;DR

Abstract

Paper Structure (35 sections, 8 figures, 17 tables)

This paper contains 35 sections, 8 figures, 17 tables.

Introduction
Related Work
Method
Building WOMD-Reasoning Dataset
Fine-tuning Multi-modal Model on WOMD-Reasoning
WOMD-Reasoning Dataset Specifications
Dataset Statistics
Interaction Contents
Dataset Quality Evaluation by Human
Vision Extension with Simulations
Dataset Application and Evaluation with Motion-LLaVA
Interaction Prediction
Traffic Rule Compliant Planning
Answering Various Driving-related Questions
Validation of Motion-LLaVA
...and 20 more sections

Figures (8)

Figure 1: Examples of traffic rule-induced interactions in WOMD-Reasoning dataset. (a) captures the traffic rule-induced interaction between the ego agent and agent #0, attributing it correctly to the stop signs. (b) shows the traffic light-controlled yielding interaction between the ego agent and agent #1. The front-view visualization is created by MetaDrive simulator li2022metadrive
Figure 2: Selected vocabulary statistics in WOMD-Reasoning. We stat vocabularies strongly related to traffic rule-induced and human intention-induced interactions in WOMD-Reasoning, illustrating that it contains abundant such interaction descriptions and reasoning.
Figure 3: A demonstration of Q&As in each part of WOMD-Reasoning dataset. We show Q&As in all categories regarding the scenario while demonstrating language analysis of overtaking, a human intention-induced interaction in WOMD-Reasoning dataset.
Figure 4: The Motion-LLaVA pipeline fine-tuning a multi-modal model with WOMD-Reasoning. Motion data go through pre-trained motion vector encoders from Multipath++ varadarajan2021multipath and a projector layer to serve together with the questions in WOMD-Reasoning as the inputs. The answers in WOMD-Reasoning serve as the supervision.
Figure 5: Interaction predictions made by Motion-LLaVA on various WOMD scenarios.
...and 3 more figures

WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving

TL;DR

Abstract

WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (8)