Table of Contents
Fetching ...

Retrieval-Augmented Open-Vocabulary Object Detection

Jooyeon Kim, Eulrang Cho, Sehyung Kim, Hyunwoo J. Kim

TL;DR

Open-vocabulary object detection seeks to recognize categories unseen during training by leveraging vision-language models. The authors propose RALF, a retrieval-augmented framework with two modules: Retrieval-Augmented Losses (RAL) that pull hard and easy negative vocabularies from a large store to shape the embedding space, and Retrieval-Augmented visual Features (RAF) that uses LLM-generated verbalized concepts to augment visual features and logits. Through training-time losses and inference-time feature augmentation, RALF improves novel-category detection on COCO and LVIS, achieving notable gains in metrics such as AP$_{50}^{\text{N}}$ on COCO and AP$_r$ on LVIS, while preserving base-category performance. The method is designed as a plug-in to existing detectors, relying on a large vocabulary store and LLM-derived concepts to broaden detector knowledge with minimal retraining.

Abstract

Open-vocabulary object detection (OVD) has been studied with Vision-Language Models (VLMs) to detect novel objects beyond the pre-trained categories. Previous approaches improve the generalization ability to expand the knowledge of the detector, using 'positive' pseudo-labels with additional 'class' names, e.g., sock, iPod, and alligator. To extend the previous methods in two aspects, we propose Retrieval-Augmented Losses and visual Features (RALF). Our method retrieves related 'negative' classes and augments loss functions. Also, visual features are augmented with 'verbalized concepts' of classes, e.g., worn on the feet, handheld music player, and sharp teeth. Specifically, RALF consists of two modules: Retrieval Augmented Losses (RAL) and Retrieval-Augmented visual Features (RAF). RAL constitutes two losses reflecting the semantic similarity with negative vocabularies. In addition, RAF augments visual features with the verbalized concepts from a large language model (LLM). Our experiments demonstrate the effectiveness of RALF on COCO and LVIS benchmark datasets. We achieve improvement up to 3.4 box AP$_{50}^{\text{N}}$ on novel categories of the COCO dataset and 3.6 mask AP$_{\text{r}}$ gains on the LVIS dataset. Code is available at https://github.com/mlvlab/RALF .

Retrieval-Augmented Open-Vocabulary Object Detection

TL;DR

Open-vocabulary object detection seeks to recognize categories unseen during training by leveraging vision-language models. The authors propose RALF, a retrieval-augmented framework with two modules: Retrieval-Augmented Losses (RAL) that pull hard and easy negative vocabularies from a large store to shape the embedding space, and Retrieval-Augmented visual Features (RAF) that uses LLM-generated verbalized concepts to augment visual features and logits. Through training-time losses and inference-time feature augmentation, RALF improves novel-category detection on COCO and LVIS, achieving notable gains in metrics such as AP on COCO and AP on LVIS, while preserving base-category performance. The method is designed as a plug-in to existing detectors, relying on a large vocabulary store and LLM-derived concepts to broaden detector knowledge with minimal retraining.

Abstract

Open-vocabulary object detection (OVD) has been studied with Vision-Language Models (VLMs) to detect novel objects beyond the pre-trained categories. Previous approaches improve the generalization ability to expand the knowledge of the detector, using 'positive' pseudo-labels with additional 'class' names, e.g., sock, iPod, and alligator. To extend the previous methods in two aspects, we propose Retrieval-Augmented Losses and visual Features (RALF). Our method retrieves related 'negative' classes and augments loss functions. Also, visual features are augmented with 'verbalized concepts' of classes, e.g., worn on the feet, handheld music player, and sharp teeth. Specifically, RALF consists of two modules: Retrieval Augmented Losses (RAL) and Retrieval-Augmented visual Features (RAF). RAL constitutes two losses reflecting the semantic similarity with negative vocabularies. In addition, RAF augments visual features with the verbalized concepts from a large language model (LLM). Our experiments demonstrate the effectiveness of RALF on COCO and LVIS benchmark datasets. We achieve improvement up to 3.4 box AP on novel categories of the COCO dataset and 3.6 mask AP gains on the LVIS dataset. Code is available at https://github.com/mlvlab/RALF .
Paper Structure (20 sections, 15 equations, 6 figures, 13 tables)

This paper contains 20 sections, 15 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Negative vocabularies and verbalized concepts from a large vocabulary set. (a) Example of negative vocabularies that can be derived from a large vocabulary set. From the vocabulary set, 'cat' and 'bottle' can be retrieved as hard negative (similar) and easy negative (dissimilar) vocabulary, given the category 'jaguar'. (b) Example of verbalized concepts that are generated from LLMs. The concepts of the objects provide more detailed information about the object, such as the attributes.
  • Figure 2: Overall pipeline of RALF. (a) The first module, RAL, is utilized during detector training. Given a ground-truth box $b$, the ground-truth box embedding $\boldsymbol{e}_b$ is extracted and used to define $\mathcal{L}^\text{RAL}$, which is augmented with hard and easy negative vocabulary. The augmented loss $\mathcal{L}^\text{RAL}$ and the baseline loss $\mathcal{L}^\text{baseline}$ are employed together to train the detector. The illustration of the other branches (e.g., box regression, distillation, and mask prediction) is omitted in both training and inference pipelines. (b) The second module, RAF, augments visual features with verbalized concepts and is pre-trained before being used in the inference pipeline. Augmented visual features $\boldsymbol{v}_r^\text{aug}$ are created through a process involving concept retriever and augmenter, using visual features $\boldsymbol{v}_r$ generated from object proposals in offline. RAF is trained with two losses ($\mathcal{L}^\text{cls}$ and $\mathcal{L}^\text{reg}$), utilizing $\boldsymbol{v}_r$, $\boldsymbol{v}_r^\text{aug}$, and $\tilde{y}_r$, which is the pseudo-label of visual feature. (c) During detector inference time, the trained RAF is utilized. Classification logits $\boldsymbol{l}_r$ trained by RAL and auxiliary logits $\boldsymbol{l}_r^\text{aux}$ influenced by RAF are computed with text embeddings of test categories. Then, the final logits $\boldsymbol{l}_r^\text{final}$ are determined through an ensemble of $\boldsymbol{l}_r$ and $\boldsymbol{l}_r^\text{aux}$.
  • Figure 3: RAL. Given ground-truth class label $y_b$, negative retriever extracts hard negative vocabulary $V_{y_b}^\text{hard}$ and easy negative vocabulary $V_{y_b}^\text{easy}$ based on semantic similarity with $\mathcal{T}(y_b)$. To enhance the generalizability of the detector, two triplet losses (i.e. hard negative loss $\mathcal{L}^\text{hard}$ and easy negative loss $\mathcal{L}^\text{easy}$) are augmented with $V_{y_b}^\text{hard}$, $V_{y_b}^\text{easy}$, and ground-truth box embedding $\boldsymbol{e}_b$.
  • Figure 4: RAF. Verbalized concepts are generated by LLM with prompts and stored in the concept store. Given a visual feature $\boldsymbol{v}_r$, relevant concept embeddings $H_r$ and scores $\boldsymbol{s}_r$ are retrieved by the concept retriever. Then the augmenter $\mathcal{A}$ creates augmented visual feature $\boldsymbol{v}_r^\text{aug}$ with related verbalized concepts.
  • Figure 5: Qualitative results on COCO. Results of bounding box predictions on novel categories for (a) OADP and (b) OADP+RALF.
  • ...and 1 more figures