AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving
Mingfu Liang, Jong-Chyi Su, Samuel Schulter, Sparsh Garg, Shiyu Zhao, Ying Wu, Manmohan Chandraker
TL;DR
The paper tackles the open-world object detection challenge in autonomous driving by presenting AIDE, an Automatic Data Engine that uses vision-language models and large language models to automatically identify issues, curate data, auto-label, and verify detections. The system comprises Issue Finder, Data Feeder, Model Updater, and Verification, forming an iterative loop that expands the label space with novel categories while mitigating forgetting of known ones. Key contributions include a two-stage pseudo-labeling framework with OWL-v2 box proposals and CLIP-based filtering, text-guided data retrieval to reduce labeling costs, and LLM-assisted verification to generate diverse scenarios for robust testing. Experiments on AV datasets demonstrate gains over baseline open-vocabulary detectors and substantial cost reductions, illustrating the practical potential of automated data engines for scalable AV perception, albeit with caveats about potential hallucinations from VLMs/LLMs and the need for human oversight in safety-critical contexts.
Abstract
Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of continuously curating and annotating data with significant human effort. We propose to leverage recent advances in vision-language and large language models to design an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. This process operates iteratively, allowing for continuous self-improvement of the model. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
