NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving
Yuan Gao, Mattia Piccinini, Roberto Brusnicki, Yuchen Zhang, Johannes Betz
TL;DR
NuRisk addresses the need for quantitative, agent-level spatio-temporal risk reasoning in autonomous driving by providing a BEV-sequence Visual Question Answering dataset sourced from nuScenes, Waymo, and CommonRoad. The authors benchmark pre-trained Vision-Language Models, reveal substantial gaps in spatio-temporal risk reasoning, and show that physics-enhanced inputs significantly improve performance. A parameter-efficient LoRA-based fine-tuning of a 7B VLM agent (Qwen2.5-VL-7B-Instruct) yields 41.1% accuracy, MAE 1.01, and 10.2 s latency, outperforming proprietary baselines and demonstrating explicit causal spatio-temporal reasoning. NuRisk thus establishes a critical benchmark and a practical training pathway toward robust, real-time risk reasoning for autonomous driving systems.
Abstract
Understanding risk in autonomous driving requires not only perception and prediction, but also high-level reasoning about agent behavior and context. Current Vision Language Models (VLMs)-based methods primarily ground agents in static images and provide qualitative judgments, lacking the spatio-temporal reasoning needed to capture how risks evolve over time. To address this gap, we propose NuRisk, a comprehensive Visual Question Answering (VQA) dataset comprising 2,900 scenarios and 1.1 million agent-level samples, built on real-world data from nuScenes and Waymo, supplemented with safety-critical scenarios from the CommonRoad simulator. The dataset provides Bird-Eye-View (BEV) based sequential images with quantitative, agent-level risk annotations, enabling spatio-temporal reasoning. We benchmark well-known VLMs across different prompting techniques and find that they fail to perform explicit spatio-temporal reasoning, resulting in a peak accuracy of 33% at high latency. To address these shortcomings, our fine-tuned 7B VLM agent improves accuracy to 41% and reduces latency by 75%, demonstrating explicit spatio-temporal reasoning capabilities that proprietary models lacked. While this represents a significant step forward, the modest accuracy underscores the profound challenge of the task, establishing NuRisk as a critical benchmark for advancing spatio-temporal reasoning in autonomous driving.
