Table of Contents
Fetching ...

RefAV: Towards Planning-Centric Scenario Mining

Cainan Davidson, Deva Ramanan, Neehar Peri

TL;DR

This work addresses the challenge of identifying and localizing safety-critical driving scenarios from uncurated autonomous-vehicle logs. It introduces RefAV, a large-scale benchmark built on Argoverse 2 with 10,000 natural language prompts describing complex multi-agent interactions, enabling 3D spatio-temporal localization over 20-second logs. To tackle scenario mining, the authors propose a program-synthesis-based approach, RefProg, which grounds 3D tracks by synthesizing Python programs from atomic motion primitives guided by LLMs, and compare it against several zero-shot baselines including API-driven, caption-based, and embedding-based methods. Empirical results show RefProg achieves the best performance on HOTA-Temporal and related metrics, while also demonstrating cross-dataset generalization to nuPrompt, and they discuss implications for competition outcomes and future improvements in open-vocabulary, temporal reasoning for autonomous driving safety cases.

Abstract

Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recently held competition and share insights from the community. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user-guide/tasks/scenario_mining.html

RefAV: Towards Planning-Centric Scenario Mining

TL;DR

This work addresses the challenge of identifying and localizing safety-critical driving scenarios from uncurated autonomous-vehicle logs. It introduces RefAV, a large-scale benchmark built on Argoverse 2 with 10,000 natural language prompts describing complex multi-agent interactions, enabling 3D spatio-temporal localization over 20-second logs. To tackle scenario mining, the authors propose a program-synthesis-based approach, RefProg, which grounds 3D tracks by synthesizing Python programs from atomic motion primitives guided by LLMs, and compare it against several zero-shot baselines including API-driven, caption-based, and embedding-based methods. Empirical results show RefProg achieves the best performance on HOTA-Temporal and related metrics, while also demonstrating cross-dataset generalization to nuPrompt, and they discuss implications for competition outcomes and future improvements in open-vocabulary, temporal reasoning for autonomous driving safety cases.

Abstract

Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recently held competition and share insights from the community. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user-guide/tasks/scenario_mining.html

Paper Structure

This paper contains 18 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Scenario Mining Problem Setup. Given a natural language prompt such as vehicle making left turn through ego-vehicle's path while it is raining, our problem setup requires models to determine whether the described scenario occurs within a 20-second driving log, and if so, precisely localize the referred object in 3D space and time from raw sensor data (LiDAR, 360$^\circ$ ring cameras, and HD maps). Based on the example above, a VLM should localize the start and end timestamps and 3D location of the red Mini Cooper executing a "Pittsburgh left" through the ego-vehicle's path with a 3D track. Notably, the "Pittsburgh left" is a regional driving practice where a driver quickly makes a left turn before oncoming traffic proceeds. Although common in Pittsburgh, this maneuver is technically illegal. Therefore, we argue that scenario mining is critical for validating end-to-end autonomy in order to build a comprehensive safety case. Note that referred objects are shown in green, related objects in blue, and other objects in red.
  • Figure 2: RefAV Dataset Creation. First, we define a set of 28 atomic functions that identify the state of an object track, its relationship with other objects (stored in an underlying scene graph), and a set of boolean logical operators to support function composition. Next, we prompt an LLM to permute these atomic functions and generate a program and corresponding natural language description. Finally, we execute the generated code on ground-truth tracks and visualize the referred object track to manually verify that the program output matches the natural language prompt. Code that generates an incorrect video is modified by an annotator and re-executed. We sample valid programs to maximize scenario diversity in our dataset.
  • Figure 3: Examples of Multi-Agent Interactions. We visualize representative examples from RefAV to highlight the diversity of our dataset. In (a), we capture the interactions between vulnerable road users and vehicles at a crowded intersection. Scenario (b) presents an atypical instance of a common multi-agent interaction (e.g. pedestrian walking a dog). In (c), we show a complex ego-vehicle trajectory that involves multiple moving vehicles. Scenario (d) illustrates an example of a rare multi-object interaction. In (e), we highlight a scenario that might require evasive maneuvers from the ego-vehicle (e.g. the occluded pedestrian might cross the path of the ego-vehicle). Finally, subfigure (f) visualizes a scenario with a multiple-step relationship (e.g. the official signaler is standing inside of a construction zone). Note that we show referred objects in green, related objects in blue, and all other objects in red.
  • Figure 4: Method Overview. RefProg is a dual-path method that independently generates 3D perception outputs and Python-based programs for referential grounding. Given raw LiDAR and RGB inputs, RefProg runs a offline 3D perception model to generate high quality 3D tracks. In parallel, it prompts an LLM to generate code to identify the referred track. Finally, the generated code is executed to filter the output of the offline 3D perception model to produce a final set of referred objects, related objects, and other objects.
  • Figure 5: Manual Annotation Tool. We create an annotation tool to assist with labeling manually defined scenarios. Our tool allows us to quickly annotate multi-object referential tracks in AV2.
  • ...and 3 more figures