Table of Contents
Fetching ...

Open-Vocabulary Object Detection via Neighboring Region Attention Alignment

Sunyuan Qiang, Xianfei Li, Yanyan Liang, Wenlong Liao, Tao He, Pai Peng

TL;DR

This work tackles open-vocabulary object detection by addressing insufficient inter-region relational information during distillation with vision-language models. It introduces Neighboring Region Attention Alignment (NRAA), which samples neighboring regions around each proposal and applies attention over region tokens to produce relational features that are aligned with VLM embeddings via an infoNCE loss. Empirical results on OV-COCO and OV-LVIS show substantial gains over prior distillation-based methods while maintaining strong base-class performance, demonstrating the importance of relational context in cross-modal alignment. The method provides a practical, end-to-end approach that enhances open-vocabulary inference in two-stage detectors without increasing inference cost.

Abstract

The nature of diversity in real-world environments necessitates neural network models to expand from closed category settings to accommodate novel emerging categories. In this paper, we study the open-vocabulary object detection (OVD), which facilitates the detection of novel object classes under the supervision of only base annotations and open-vocabulary knowledge. However, we find that the inadequacy of neighboring relationships between regions during the alignment process inevitably constrains the performance on recent distillation-based OVD strategies. To this end, we propose Neighboring Region Attention Alignment (NRAA), which performs alignment within the attention mechanism of a set of neighboring regions to boost the open-vocabulary inference. Specifically, for a given proposal region, we randomly explore the neighboring boxes and conduct our proposed neighboring region attention (NRA) mechanism to extract relationship information. Then, this interaction information is seamlessly provided into the distillation procedure to assist the alignment between the detector and the pre-trained vision-language models (VLMs). Extensive experiments validate that our proposed model exhibits superior performance on open-vocabulary benchmarks.

Open-Vocabulary Object Detection via Neighboring Region Attention Alignment

TL;DR

This work tackles open-vocabulary object detection by addressing insufficient inter-region relational information during distillation with vision-language models. It introduces Neighboring Region Attention Alignment (NRAA), which samples neighboring regions around each proposal and applies attention over region tokens to produce relational features that are aligned with VLM embeddings via an infoNCE loss. Empirical results on OV-COCO and OV-LVIS show substantial gains over prior distillation-based methods while maintaining strong base-class performance, demonstrating the importance of relational context in cross-modal alignment. The method provides a practical, end-to-end approach that enhances open-vocabulary inference in two-stage detectors without increasing inference cost.

Abstract

The nature of diversity in real-world environments necessitates neural network models to expand from closed category settings to accommodate novel emerging categories. In this paper, we study the open-vocabulary object detection (OVD), which facilitates the detection of novel object classes under the supervision of only base annotations and open-vocabulary knowledge. However, we find that the inadequacy of neighboring relationships between regions during the alignment process inevitably constrains the performance on recent distillation-based OVD strategies. To this end, we propose Neighboring Region Attention Alignment (NRAA), which performs alignment within the attention mechanism of a set of neighboring regions to boost the open-vocabulary inference. Specifically, for a given proposal region, we randomly explore the neighboring boxes and conduct our proposed neighboring region attention (NRA) mechanism to extract relationship information. Then, this interaction information is seamlessly provided into the distillation procedure to assist the alignment between the detector and the pre-trained vision-language models (VLMs). Extensive experiments validate that our proposed model exhibits superior performance on open-vocabulary benchmarks.
Paper Structure (13 sections, 14 equations, 8 figures, 11 tables)

This paper contains 13 sections, 14 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: The comparison between recent distillation (alignment) based OVD methods and our proposed Neighboring Region Attention Alignment (NRAA). IE and RPN denote the pretrained vision-language image encoder from CLIP and region proposal network. (G.) and (B.) denote the global pooling and block pooling operators, respectively. Here, we omit the text encoder used in BARON and our method for better visualization. (a) ViLD DBLP:conf/iclr/GuLKC22. (b) OADP DBLP:journals/corr/abs-2303-05892. (c) BARON wu2023baron. (d) NRAA (Ours).
  • Figure 2: (a) From top to bottom, the increasing content in VLMs (e.g., expanded inputs) has unearthed more knowledge from the VLMs, and it should be consistent with the yet-to-be-distilled information from the detector during alignment process. (b) Performance comparison of $\text{AP}_{50}^\text{novel}$ metric on OV-COCO benchmarks.
  • Figure 3: Overview architecture of our proposed NRAA model. (a) Testing stage: NRAA is built upon Faster R-CNN, achieving the OVD classification within the multi-modal text representation space. (b) Training stage: NRAA model introduces an attention mechanism to facilitate interaction among a set of region features for alignment. Here, we omit the basic detection losses for better visualization. (c) Our neighboring region attention (NRA) module.
  • Figure 4: The visualization results of neighboring regions, where the green color represents the original image $\mathbf{x}$, the blue color corresponds to the proposal region $r$, the violet color represents neighboring areas labeled with index numbers (1-8), $\{\bar{r}_i\}_{i=1}^8$, and the brown dashed lines indicate the outermost expanded region $r_\text{outer}$ fed to the image encoder. Best viewed with color. (a) Neighboring regions of a region proposal. (b) (c) Two sampling examples.
  • Figure 5: The ablation architectures of NRA module, serving as an explanatory note for the configuration settings in Table \ref{['table_appendix_ablation_study_nra_pos']}. The configuration of the NRA module is categorized into training and testing phases. During the testing stage, only classification needs to be considered, leading to two scenarios, denoted as (a) and (b), namely, using the NRA module and not using the NRA module. In the training process, both classification and alignment need to be simultaneously considered. Consequently, there are three settings for integrating the NRA module: adding it solely to the classification (c), adding it to both positions (d), and adding it solely to the alignment (e).
  • ...and 3 more figures