Table of Contents
Fetching ...

SIA-OVD: Shape-Invariant Adapter for Bridging the Image-Region Gap in Open-Vocabulary Detection

Zishuo Wang, Wenhao Zhou, Jinglin Xu, Yuxin Peng

TL;DR

This work tackles the gap between image-level CLIP representations and region-level features in open-vocabulary detection by attributing it to RoI-induced shape deformation. It introduces Shape-Invariant Adapter (SIA), a set of shape-aware, bottleneck adapters whose outputs are selectively combined via an Adapter Allocation Mechanism based on region aspect ratio, yielding shape-invariant region embeddings aligned with CLIP text features. By keeping the CLIP image encoder frozen and employing a two-stage training regime, SIA achieves improved region classification and novel-category detection on COCO-OVD and OV-LVIS benchmarks, surpassing representative baselines. The approach offers a practical path to robust open-vocabulary detection with minimal fine-tuning and strong generalization to unseen categories.

Abstract

Open-vocabulary detection (OVD) aims to detect novel objects without instance-level annotations to achieve open-world object detection at a lower cost. Existing OVD methods mainly rely on the powerful open-vocabulary image-text alignment capability of Vision-Language Pretrained Models (VLM) such as CLIP. However, CLIP is trained on image-text pairs and lacks the perceptual ability for local regions within an image, resulting in the gap between image and region representations. Directly using CLIP for OVD causes inaccurate region classification. We find the image-region gap is primarily caused by the deformation of region feature maps during region of interest (RoI) extraction. To mitigate the inaccurate region classification in OVD, we propose a new Shape-Invariant Adapter named SIA-OVD to bridge the image-region gap in the OVD task. SIA-OVD learns a set of feature adapters for regions with different shapes and designs a new adapter allocation mechanism to select the optimal adapter for each region. The adapted region representations can align better with text representations learned by CLIP. Extensive experiments demonstrate that SIA-OVD effectively improves the classification accuracy for regions by addressing the gap between images and regions caused by shape deformation. SIA-OVD achieves substantial improvements over representative methods on the COCO-OVD benchmark. The code is available at https://github.com/PKU-ICST-MIPL/SIA-OVD_ACMMM2024.

SIA-OVD: Shape-Invariant Adapter for Bridging the Image-Region Gap in Open-Vocabulary Detection

TL;DR

This work tackles the gap between image-level CLIP representations and region-level features in open-vocabulary detection by attributing it to RoI-induced shape deformation. It introduces Shape-Invariant Adapter (SIA), a set of shape-aware, bottleneck adapters whose outputs are selectively combined via an Adapter Allocation Mechanism based on region aspect ratio, yielding shape-invariant region embeddings aligned with CLIP text features. By keeping the CLIP image encoder frozen and employing a two-stage training regime, SIA achieves improved region classification and novel-category detection on COCO-OVD and OV-LVIS benchmarks, surpassing representative baselines. The approach offers a practical path to robust open-vocabulary detection with minimal fine-tuning and strong generalization to unseen categories.

Abstract

Open-vocabulary detection (OVD) aims to detect novel objects without instance-level annotations to achieve open-world object detection at a lower cost. Existing OVD methods mainly rely on the powerful open-vocabulary image-text alignment capability of Vision-Language Pretrained Models (VLM) such as CLIP. However, CLIP is trained on image-text pairs and lacks the perceptual ability for local regions within an image, resulting in the gap between image and region representations. Directly using CLIP for OVD causes inaccurate region classification. We find the image-region gap is primarily caused by the deformation of region feature maps during region of interest (RoI) extraction. To mitigate the inaccurate region classification in OVD, we propose a new Shape-Invariant Adapter named SIA-OVD to bridge the image-region gap in the OVD task. SIA-OVD learns a set of feature adapters for regions with different shapes and designs a new adapter allocation mechanism to select the optimal adapter for each region. The adapted region representations can align better with text representations learned by CLIP. Extensive experiments demonstrate that SIA-OVD effectively improves the classification accuracy for regions by addressing the gap between images and regions caused by shape deformation. SIA-OVD achieves substantial improvements over representative methods on the COCO-OVD benchmark. The code is available at https://github.com/PKU-ICST-MIPL/SIA-OVD_ACMMM2024.
Paper Structure (14 sections, 10 equations, 7 figures, 3 tables)

This paper contains 14 sections, 10 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison between Close-Set Object Detection and Open-Vocabulary Object Detection. The Close-Set detector learns object knowledge from instance-level supervision, where RoIAlign deforms the regions of objects. The OVD detector learns from image-level supervision, where the regions of objects keep the original shape. This difference causes the gap between the image and region in OVD, especially for deformed object regions.
  • Figure 2: Overview of the SIA-OVD framework. It takes image and prompt templates filled in with class names as input and outputs the bounding boxes of objects in the whole image along with prediction classification.
  • Figure 3: Illustration of Shape-Invariant Adapter.
  • Figure 4: Classification accuracy for regions with different shapes of CLIP, CORA, and our SIA with RN50 backbone on COCO-OVD validation set with ground-truth bounding boxes.
  • Figure 5: Effect of the number of adapters on detection performance (AP50) for all and rare categories in LVIS.
  • ...and 2 more figures