Table of Contents
Fetching ...

WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving

Seungjun Yu, Seonho Lee, Namho Kim, Jaeyo Shin, Junsung Park, Wonjeong Ryu, Raehyuk Jung, Hyunjung Shim

TL;DR

This work introduces WaymoQA, a training-enabled, multi-view driving QA dataset designed to support safety-critical reasoning in autonomous driving. It defines a two-stage Safety-Critical Reasoning task and provides both image and video QA formats to capture immediate risks and downstream consequences. Empirical results show that current multimodal language models underperform on safety-critical scenes, but targeted fine-tuning on WaymoQA yields substantial gains and narrows the gap with normal driving reasoning, while highlighting remaining challenges such as temporal grounding and coordinate-frame understanding. The dataset and findings offer a concrete path toward safer, more capable driving agents by enabling end-to-end reasoning across perception, prediction, and planning with diverse supervision signals.

Abstract

Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents.

WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving

TL;DR

This work introduces WaymoQA, a training-enabled, multi-view driving QA dataset designed to support safety-critical reasoning in autonomous driving. It defines a two-stage Safety-Critical Reasoning task and provides both image and video QA formats to capture immediate risks and downstream consequences. Empirical results show that current multimodal language models underperform on safety-critical scenes, but targeted fine-tuning on WaymoQA yields substantial gains and narrows the gap with normal driving reasoning, while highlighting remaining challenges such as temporal grounding and coordinate-frame understanding. The dataset and findings offer a concrete path toward safer, more capable driving agents by enabling end-to-end reasoning across perception, prediction, and planning with diverse supervision signals.

Abstract

Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents.

Paper Structure

This paper contains 26 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: WaymoQA Overview. Multi-view scene and two-stage reasoning (Perception $\rightarrow$ Prediction $\rightarrow$ Planning) under compounding risks. The first plan detours around a parked motorcycle; the second plan returns to avoid an oncoming car. Counterfactual questions ask about alternative actions, and Safety-Critical Relationship questions capture spatial relations among agents in safety-critical scenes.
  • Figure 2: Comparison across Difference Views of Same Scene.
  • Figure 3: Overview. A three-step process: (1) filter Waymo End-to-End sequences using U.S. NHTSA pre-crash scenario types national2007pre and select balanced safety-critical key frames; (2) construct a structured QA bank covering core reasoning skills; (3) complete human answering and verification to produce Video/Image VQA and MCQ splits.
  • Figure 4: VQA distribution in the WaymoQA dataset.
  • Figure 5: Examples of Video/Image QA in WaymoQA. The video QA focuses on the realized temporal sequence, while the image QA, given a single key frame, supports broader, decision-oriented questions about feasible actions and near-term outcomes. Pairing image-based and video-based QA increases the diversity of reasoning signals for rare, long-tail safety-critical scenarios.
  • ...and 1 more figures