Table of Contents
Fetching ...

NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving

Yuan Gao, Mattia Piccinini, Roberto Brusnicki, Yuchen Zhang, Johannes Betz

TL;DR

NuRisk addresses the need for quantitative, agent-level spatio-temporal risk reasoning in autonomous driving by providing a BEV-sequence Visual Question Answering dataset sourced from nuScenes, Waymo, and CommonRoad. The authors benchmark pre-trained Vision-Language Models, reveal substantial gaps in spatio-temporal risk reasoning, and show that physics-enhanced inputs significantly improve performance. A parameter-efficient LoRA-based fine-tuning of a 7B VLM agent (Qwen2.5-VL-7B-Instruct) yields 41.1% accuracy, MAE 1.01, and 10.2 s latency, outperforming proprietary baselines and demonstrating explicit causal spatio-temporal reasoning. NuRisk thus establishes a critical benchmark and a practical training pathway toward robust, real-time risk reasoning for autonomous driving systems.

Abstract

Understanding risk in autonomous driving requires not only perception and prediction, but also high-level reasoning about agent behavior and context. Current Vision Language Models (VLMs)-based methods primarily ground agents in static images and provide qualitative judgments, lacking the spatio-temporal reasoning needed to capture how risks evolve over time. To address this gap, we propose NuRisk, a comprehensive Visual Question Answering (VQA) dataset comprising 2,900 scenarios and 1.1 million agent-level samples, built on real-world data from nuScenes and Waymo, supplemented with safety-critical scenarios from the CommonRoad simulator. The dataset provides Bird-Eye-View (BEV) based sequential images with quantitative, agent-level risk annotations, enabling spatio-temporal reasoning. We benchmark well-known VLMs across different prompting techniques and find that they fail to perform explicit spatio-temporal reasoning, resulting in a peak accuracy of 33% at high latency. To address these shortcomings, our fine-tuned 7B VLM agent improves accuracy to 41% and reduces latency by 75%, demonstrating explicit spatio-temporal reasoning capabilities that proprietary models lacked. While this represents a significant step forward, the modest accuracy underscores the profound challenge of the task, establishing NuRisk as a critical benchmark for advancing spatio-temporal reasoning in autonomous driving.

NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving

TL;DR

NuRisk addresses the need for quantitative, agent-level spatio-temporal risk reasoning in autonomous driving by providing a BEV-sequence Visual Question Answering dataset sourced from nuScenes, Waymo, and CommonRoad. The authors benchmark pre-trained Vision-Language Models, reveal substantial gaps in spatio-temporal risk reasoning, and show that physics-enhanced inputs significantly improve performance. A parameter-efficient LoRA-based fine-tuning of a 7B VLM agent (Qwen2.5-VL-7B-Instruct) yields 41.1% accuracy, MAE 1.01, and 10.2 s latency, outperforming proprietary baselines and demonstrating explicit causal spatio-temporal reasoning. NuRisk thus establishes a critical benchmark and a practical training pathway toward robust, real-time risk reasoning for autonomous driving systems.

Abstract

Understanding risk in autonomous driving requires not only perception and prediction, but also high-level reasoning about agent behavior and context. Current Vision Language Models (VLMs)-based methods primarily ground agents in static images and provide qualitative judgments, lacking the spatio-temporal reasoning needed to capture how risks evolve over time. To address this gap, we propose NuRisk, a comprehensive Visual Question Answering (VQA) dataset comprising 2,900 scenarios and 1.1 million agent-level samples, built on real-world data from nuScenes and Waymo, supplemented with safety-critical scenarios from the CommonRoad simulator. The dataset provides Bird-Eye-View (BEV) based sequential images with quantitative, agent-level risk annotations, enabling spatio-temporal reasoning. We benchmark well-known VLMs across different prompting techniques and find that they fail to perform explicit spatio-temporal reasoning, resulting in a peak accuracy of 33% at high latency. To address these shortcomings, our fine-tuned 7B VLM agent improves accuracy to 41% and reduces latency by 75%, demonstrating explicit spatio-temporal reasoning capabilities that proprietary models lacked. While this represents a significant step forward, the modest accuracy underscores the profound challenge of the task, establishing NuRisk as a critical benchmark for advancing spatio-temporal reasoning in autonomous driving.

Paper Structure

This paper contains 28 sections, 3 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of NuRisk: Existing VLM-based risk assessment is typically limited to (i) image-plane grounding and (ii) qualitative assessment. NuRisk introduces an agent-level quantitative dataset that enables (iii) spatial–temporal quantitative assessment for risk reasoning.
  • Figure 2: Framework of NuRisk. Multi-modal inputs are processed into BEV scenes and risk metrics to enable conversation-based VQA with chain-of-thought reasoning, supporting risk evaluation, benchmarking, fine-tuning, and safety-critical scenario analysis.
  • Figure 3: Dataset statistics and risk distribution of NuRisk. (a) Sunburst diagram showing scenario composition across data sources (NuScenes, Waymo, CommonRoad), categorized by minimum agent risk levels and agent types. (b) Histogram of agent-level risk score distribution (0--5) in the final VQA dataset, with source-wise breakdown and overall percentages.
  • Figure 4: NuRisk VLM Agent Fine-tuning Architecture.
  • Figure 5: Performance comparison between the best proprietary and our two fine-tuned NuRisk VLM Agent configurations, with the inverse scale of the response time and MAE to make it more interpretable.