Table of Contents
Fetching ...

WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving

Yiheng Li, Cunxin Fan, Chongjian Ge, Zhihao Zhao, Chenran Li, Chenfeng Xu, Huaxiu Yao, Masayoshi Tomizuka, Bolei Zhou, Chen Tang, Mingyu Ding, Wei Zhan

TL;DR

WOMD-Reasoning introduces a large-scale, multi-modal dataset focused on traffic-rule-induced and human-intention interactions in driving, built automatically atop WOMD using rule-based translations and GPT-4 prompts. It demonstrates Motion-LLaVA, a motion-language model fine-tuned on the dataset, achieving strong performance in interaction prediction, traffic-rule compliant planning, and driving-related Q&A. The work further extends to a vision-enabled extension via simulations and shows language-augmented trajectory prediction improves downstream tasks. Together, the dataset and methods enable richer interaction reasoning and improved safety-focused planning for autonomous driving systems.

Abstract

Language models uncover unprecedented abilities in analyzing driving scenarios, owing to their limitless knowledge accumulated from text-based pre-training. Naturally, they should particularly excel in analyzing rule-based interactions, such as those triggered by traffic laws, which are well documented in texts. However, such interaction analysis remains underexplored due to the lack of dedicated language datasets that address it. Therefore, we propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a comprehensive large-scale Q&As dataset built on WOMD focusing on describing and reasoning traffic rule-induced interactions in driving scenarios. WOMD-Reasoning also presents by far the largest multi-modal Q&A dataset, with 3 million Q&As on real-world driving scenarios, covering a wide range of driving topics from map descriptions and motion status descriptions to narratives and analyses of agents' interactions, behaviors, and intentions. To showcase the applications of WOMD-Reasoning, we design Motion-LLaVA, a motion-language model fine-tuned on WOMD-Reasoning. Quantitative and qualitative evaluations are performed on WOMD-Reasoning dataset as well as the outputs of Motion-LLaVA, supporting the data quality and wide applications of WOMD-Reasoning, in interaction predictions, traffic rule compliance plannings, etc. The dataset and its vision modal extension are available on https://waymo.com/open/download/. The codes & prompts to build it are available on https://github.com/yhli123/WOMD-Reasoning.

WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving

TL;DR

WOMD-Reasoning introduces a large-scale, multi-modal dataset focused on traffic-rule-induced and human-intention interactions in driving, built automatically atop WOMD using rule-based translations and GPT-4 prompts. It demonstrates Motion-LLaVA, a motion-language model fine-tuned on the dataset, achieving strong performance in interaction prediction, traffic-rule compliant planning, and driving-related Q&A. The work further extends to a vision-enabled extension via simulations and shows language-augmented trajectory prediction improves downstream tasks. Together, the dataset and methods enable richer interaction reasoning and improved safety-focused planning for autonomous driving systems.

Abstract

Language models uncover unprecedented abilities in analyzing driving scenarios, owing to their limitless knowledge accumulated from text-based pre-training. Naturally, they should particularly excel in analyzing rule-based interactions, such as those triggered by traffic laws, which are well documented in texts. However, such interaction analysis remains underexplored due to the lack of dedicated language datasets that address it. Therefore, we propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a comprehensive large-scale Q&As dataset built on WOMD focusing on describing and reasoning traffic rule-induced interactions in driving scenarios. WOMD-Reasoning also presents by far the largest multi-modal Q&A dataset, with 3 million Q&As on real-world driving scenarios, covering a wide range of driving topics from map descriptions and motion status descriptions to narratives and analyses of agents' interactions, behaviors, and intentions. To showcase the applications of WOMD-Reasoning, we design Motion-LLaVA, a motion-language model fine-tuned on WOMD-Reasoning. Quantitative and qualitative evaluations are performed on WOMD-Reasoning dataset as well as the outputs of Motion-LLaVA, supporting the data quality and wide applications of WOMD-Reasoning, in interaction predictions, traffic rule compliance plannings, etc. The dataset and its vision modal extension are available on https://waymo.com/open/download/. The codes & prompts to build it are available on https://github.com/yhli123/WOMD-Reasoning.
Paper Structure (35 sections, 8 figures, 17 tables)

This paper contains 35 sections, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Examples of traffic rule-induced interactions in WOMD-Reasoning dataset. (a) captures the traffic rule-induced interaction between the ego agent and agent #0, attributing it correctly to the stop signs. (b) shows the traffic light-controlled yielding interaction between the ego agent and agent #1. The front-view visualization is created by MetaDrive simulator li2022metadrive
  • Figure 2: Selected vocabulary statistics in WOMD-Reasoning. We stat vocabularies strongly related to traffic rule-induced and human intention-induced interactions in WOMD-Reasoning, illustrating that it contains abundant such interaction descriptions and reasoning.
  • Figure 3: A demonstration of Q&As in each part of WOMD-Reasoning dataset. We show Q&As in all categories regarding the scenario while demonstrating language analysis of overtaking, a human intention-induced interaction in WOMD-Reasoning dataset.
  • Figure 4: The Motion-LLaVA pipeline fine-tuning a multi-modal model with WOMD-Reasoning. Motion data go through pre-trained motion vector encoders from Multipath++ varadarajan2021multipath and a projector layer to serve together with the questions in WOMD-Reasoning as the inputs. The answers in WOMD-Reasoning serve as the supervision.
  • Figure 5: Interaction predictions made by Motion-LLaVA on various WOMD scenarios.
  • ...and 3 more figures