RefAV: Towards Planning-Centric Scenario Mining
Cainan Davidson, Deva Ramanan, Neehar Peri
TL;DR
This work addresses the challenge of identifying and localizing safety-critical driving scenarios from uncurated autonomous-vehicle logs. It introduces RefAV, a large-scale benchmark built on Argoverse 2 with 10,000 natural language prompts describing complex multi-agent interactions, enabling 3D spatio-temporal localization over 20-second logs. To tackle scenario mining, the authors propose a program-synthesis-based approach, RefProg, which grounds 3D tracks by synthesizing Python programs from atomic motion primitives guided by LLMs, and compare it against several zero-shot baselines including API-driven, caption-based, and embedding-based methods. Empirical results show RefProg achieves the best performance on HOTA-Temporal and related metrics, while also demonstrating cross-dataset generalization to nuPrompt, and they discuss implications for competition outcomes and future improvements in open-vocabulary, temporal reasoning for autonomous driving safety cases.
Abstract
Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recently held competition and share insights from the community. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user-guide/tasks/scenario_mining.html
