Table of Contents
Fetching ...

AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving

Mingfu Liang, Jong-Chyi Su, Samuel Schulter, Sparsh Garg, Shiyu Zhao, Ying Wu, Manmohan Chandraker

TL;DR

The paper tackles the open-world object detection challenge in autonomous driving by presenting AIDE, an Automatic Data Engine that uses vision-language models and large language models to automatically identify issues, curate data, auto-label, and verify detections. The system comprises Issue Finder, Data Feeder, Model Updater, and Verification, forming an iterative loop that expands the label space with novel categories while mitigating forgetting of known ones. Key contributions include a two-stage pseudo-labeling framework with OWL-v2 box proposals and CLIP-based filtering, text-guided data retrieval to reduce labeling costs, and LLM-assisted verification to generate diverse scenarios for robust testing. Experiments on AV datasets demonstrate gains over baseline open-vocabulary detectors and substantial cost reductions, illustrating the practical potential of automated data engines for scalable AV perception, albeit with caveats about potential hallucinations from VLMs/LLMs and the need for human oversight in safety-critical contexts.

Abstract

Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of continuously curating and annotating data with significant human effort. We propose to leverage recent advances in vision-language and large language models to design an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. This process operates iteratively, allowing for continuous self-improvement of the model. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.

AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving

TL;DR

The paper tackles the open-world object detection challenge in autonomous driving by presenting AIDE, an Automatic Data Engine that uses vision-language models and large language models to automatically identify issues, curate data, auto-label, and verify detections. The system comprises Issue Finder, Data Feeder, Model Updater, and Verification, forming an iterative loop that expands the label space with novel categories while mitigating forgetting of known ones. Key contributions include a two-stage pseudo-labeling framework with OWL-v2 box proposals and CLIP-based filtering, text-guided data retrieval to reduce labeling costs, and LLM-assisted verification to generate diverse scenarios for robust testing. Experiments on AV datasets demonstrate gains over baseline open-vocabulary detectors and substantial cost reductions, illustrating the practical potential of automated data engines for scalable AV perception, albeit with caveats about potential hallucinations from VLMs/LLMs and the need for human oversight in safety-critical contexts.

Abstract

Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of continuously curating and annotating data with significant human effort. We propose to leverage recent advances in vision-language and large language models to design an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. This process operates iteratively, allowing for continuous self-improvement of the model. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
Paper Structure (35 sections, 13 figures, 11 tables)

This paper contains 35 sections, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Top: Components for DevOp systems for autonomous driving. Bottom: With our automatic data system, we can achieve similar performance with less labeling and training costs.
  • Figure 2: Our design of the automatic data engine includes Issue Finder, Data Feeder, Model Updater, and Verification. The Issue Finder automatically identifies novel categories using the dense captioning model. In the Data Feeder, we employ VLMs to efficiently search for relevant data for training, significantly reducing the inference time for generating pseudo-labels in the subsequent steps and filtering out unrelated images for training. The model is updated in the Model Updater using auto-labeling by VLMs, enabling the recognition of novel categories without incurring any labeling costs. To verify the model, in Verification, we use LLMs to generate descriptions of variations in scenarios and then assess predictions on images queried by VLMs.
  • Figure 3: Examples of the Issue Finder. We use Otter li2023otter to generate detailed descriptions of an image, then identify the novel category that is missing in the label space (shown in red).
  • Figure 4: Visualization of the queried images from Data Feeder on three novel categories.
  • Figure 5: Our two-stage pseudo-labeling for Model Updater: generate boxes by zero-shot detection and label by CLIP filtering.
  • ...and 8 more figures