Table of Contents
Fetching ...

RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection

Fangyi Chen, Han Zhang, Zhantao Yang, Hao Chen, Kai Hu, Marios Savvides

TL;DR

RTGen is proposed to generate scalable open-vocabulary region-text pairs and its capability to boost the performance of open-vocabulary object detection is demonstrated, delivering superior performance compared to the existing state-of-the-art methods.

Abstract

Open-vocabulary object detection (OVD) requires solid modeling of the region-semantic relationship, which could be learned from massive region-text pairs. However, such data is limited in practice due to significant annotation costs. In this work, we propose RTGen to generate scalable open-vocabulary region-text pairs and demonstrate its capability to boost the performance of open-vocabulary object detection. RTGen includes both text-to-region and region-to-text generation processes on scalable image-caption data. The text-to-region generation is powered by image inpainting, directed by our proposed scene-aware inpainting guider for overall layout harmony. For region-to-text generation, we perform multiple region-level image captioning with various prompts and select the best matching text according to CLIP similarity. To facilitate detection training on region-text pairs, we also introduce a localization-aware region-text contrastive loss that learns object proposals tailored with different localization qualities. Extensive experiments demonstrate that our RTGen can serve as a scalable, semantically rich, and effective source for open-vocabulary object detection and continue to improve the model performance when more data is utilized, delivering superior performance compared to the existing state-of-the-art methods.

RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection

TL;DR

RTGen is proposed to generate scalable open-vocabulary region-text pairs and its capability to boost the performance of open-vocabulary object detection is demonstrated, delivering superior performance compared to the existing state-of-the-art methods.

Abstract

Open-vocabulary object detection (OVD) requires solid modeling of the region-semantic relationship, which could be learned from massive region-text pairs. However, such data is limited in practice due to significant annotation costs. In this work, we propose RTGen to generate scalable open-vocabulary region-text pairs and demonstrate its capability to boost the performance of open-vocabulary object detection. RTGen includes both text-to-region and region-to-text generation processes on scalable image-caption data. The text-to-region generation is powered by image inpainting, directed by our proposed scene-aware inpainting guider for overall layout harmony. For region-to-text generation, we perform multiple region-level image captioning with various prompts and select the best matching text according to CLIP similarity. To facilitate detection training on region-text pairs, we also introduce a localization-aware region-text contrastive loss that learns object proposals tailored with different localization qualities. Extensive experiments demonstrate that our RTGen can serve as a scalable, semantically rich, and effective source for open-vocabulary object detection and continue to improve the model performance when more data is utilized, delivering superior performance compared to the existing state-of-the-art methods.
Paper Structure (27 sections, 7 equations, 9 figures, 9 tables)

This paper contains 27 sections, 7 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: (a) Image-caption pairs may lack visual-textual correlation, addressed through the text-to-region (T2R) and region-to-text (R2T) processes. (b) The generative processes introduce a breadth of visual and textual diversity. (c) The average precision on OV-COCO novel classes when training with different percentages of our generated data which demonstrates a steep increase when R2T+T2R is applied, evidencing a marked improvement proportional to the volume of data utilized .
  • Figure 2: Framework overview. We generate region-text pairs from image-caption pairs through T2R and R2T processes. Detectors are jointed trained via localization-aware region-text contrastive loss.
  • Figure 3: The scene-aware allocation for inpainting is an underexplored but essential task in our generation framework. We propose SAIG to allocate layout with the awareness of the scene.
  • Figure 4: (a) SAIG allocates the phrases "Man" and "a glass of beer" to more suitable boxes according to the scene. (b) SAIG preserves more semantics information during allocation than grounding.
  • Figure 5: Alignment quality of pairs. $\textcolor{red}{\times}$ indicates the low $3\%$.
  • ...and 4 more figures