Retrieval-Augmented Open-Vocabulary Object Detection
Jooyeon Kim, Eulrang Cho, Sehyung Kim, Hyunwoo J. Kim
TL;DR
Open-vocabulary object detection seeks to recognize categories unseen during training by leveraging vision-language models. The authors propose RALF, a retrieval-augmented framework with two modules: Retrieval-Augmented Losses (RAL) that pull hard and easy negative vocabularies from a large store to shape the embedding space, and Retrieval-Augmented visual Features (RAF) that uses LLM-generated verbalized concepts to augment visual features and logits. Through training-time losses and inference-time feature augmentation, RALF improves novel-category detection on COCO and LVIS, achieving notable gains in metrics such as AP$_{50}^{\text{N}}$ on COCO and AP$_r$ on LVIS, while preserving base-category performance. The method is designed as a plug-in to existing detectors, relying on a large vocabulary store and LLM-derived concepts to broaden detector knowledge with minimal retraining.
Abstract
Open-vocabulary object detection (OVD) has been studied with Vision-Language Models (VLMs) to detect novel objects beyond the pre-trained categories. Previous approaches improve the generalization ability to expand the knowledge of the detector, using 'positive' pseudo-labels with additional 'class' names, e.g., sock, iPod, and alligator. To extend the previous methods in two aspects, we propose Retrieval-Augmented Losses and visual Features (RALF). Our method retrieves related 'negative' classes and augments loss functions. Also, visual features are augmented with 'verbalized concepts' of classes, e.g., worn on the feet, handheld music player, and sharp teeth. Specifically, RALF consists of two modules: Retrieval Augmented Losses (RAL) and Retrieval-Augmented visual Features (RAF). RAL constitutes two losses reflecting the semantic similarity with negative vocabularies. In addition, RAF augments visual features with the verbalized concepts from a large language model (LLM). Our experiments demonstrate the effectiveness of RALF on COCO and LVIS benchmark datasets. We achieve improvement up to 3.4 box AP$_{50}^{\text{N}}$ on novel categories of the COCO dataset and 3.6 mask AP$_{\text{r}}$ gains on the LVIS dataset. Code is available at https://github.com/mlvlab/RALF .
