Table of Contents
Fetching ...

More Pictures Say More: Visual Intersection Network for Open Set Object Detection

Bingcheng Dong, Yuning Ding, Jinrong Zhang, Sifan Zhang, Shenglan Liu

TL;DR

This work addresses open-set object detection by proposing Visual Intersection Network for Open Set Object Detection (VINO), a DETR-based framework that maintains semantic intersections across time with a multi-image visual bank. A novel prompt-update mechanism preserves representative semantics while allowing flexible inclusion of new information, enabling strong alignment between region and category semantics with reduced pre-training demands. The approach supports both detection and segmentation via a shared semantic-intersection representation, achieving competitive results on LVIS and ODinW benchmarks and demonstrating robust zero-shot generalization. By leveraging multiple visual prompts rather than textual semantics, VINO offers a scalable, efficient path for open-set detection and broader visual tasks.

Abstract

Open Set Object Detection has seen rapid development recently, but it continues to pose significant challenges. Language-based methods, grappling with the substantial modal disparity between textual and visual modalities, require extensive computational resources to bridge this gap. Although integrating visual prompts into these frameworks shows promise for enhancing performance, it always comes with constraints related to textual semantics. In contrast, viusal-only methods suffer from the low-quality fusion of multiple visual prompts. In response, we introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO), which constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps. Our innovative multi-image visual updating mechanism learns to identify the semantic intersections from various visual prompts, enabling the flexible incorporation of new information and continuous optimization of feature representations. Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands compared to language-based methods. Furthermore, the integration of a segmentation head illustrates the broad applicability of visual intersection in various visual tasks. VINO, which requires only 7 RTX4090 GPU days to complete one epoch on the Objects365v1 dataset, achieves competitive performance on par with vision-language models on benchmarks such as LVIS and ODinW35.

More Pictures Say More: Visual Intersection Network for Open Set Object Detection

TL;DR

This work addresses open-set object detection by proposing Visual Intersection Network for Open Set Object Detection (VINO), a DETR-based framework that maintains semantic intersections across time with a multi-image visual bank. A novel prompt-update mechanism preserves representative semantics while allowing flexible inclusion of new information, enabling strong alignment between region and category semantics with reduced pre-training demands. The approach supports both detection and segmentation via a shared semantic-intersection representation, achieving competitive results on LVIS and ODinW benchmarks and demonstrating robust zero-shot generalization. By leveraging multiple visual prompts rather than textual semantics, VINO offers a scalable, efficient path for open-set detection and broader visual tasks.

Abstract

Open Set Object Detection has seen rapid development recently, but it continues to pose significant challenges. Language-based methods, grappling with the substantial modal disparity between textual and visual modalities, require extensive computational resources to bridge this gap. Although integrating visual prompts into these frameworks shows promise for enhancing performance, it always comes with constraints related to textual semantics. In contrast, viusal-only methods suffer from the low-quality fusion of multiple visual prompts. In response, we introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO), which constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps. Our innovative multi-image visual updating mechanism learns to identify the semantic intersections from various visual prompts, enabling the flexible incorporation of new information and continuous optimization of feature representations. Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands compared to language-based methods. Furthermore, the integration of a segmentation head illustrates the broad applicability of visual intersection in various visual tasks. VINO, which requires only 7 RTX4090 GPU days to complete one epoch on the Objects365v1 dataset, achieves competitive performance on par with vision-language models on benchmarks such as LVIS and ODinW35.
Paper Structure (18 sections, 8 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 8 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of various object detection models under visual and textual prompts. The figure highlights the challenges faced by existing models such as language-based vision models, siamese networks, optimized visual prompts, and interactive visual prompts, including issues with textual ambiguity, redundant information, semantic overlap, and fine-grained comprehension. In contrast, Vision Intersection Network (VINO) effectively addresses these challenges by leveraging the semantic intersection of multi-image visual prompts, enhancing detection accuracy and generalization in open set environments.
  • Figure 2: The model architecture of VINO with multi-image visual bank.The VINO model architecture incorporates a visual prompt encoder that extracts features from cropped images in $T_{t-1}$ as visual prompts and stores them in the multi-image visual bank. When a new target image is processed, the model uses labels to get visual prompts and the prompt encoder to extract relevant features. Through cosine similarity-based selection and feature updating, the multi-image visual bank maintains and refines semantic intersections across categories, thereby improving the detection and alignment of objects in the target image.
  • Figure 3: The Visualization of VINO-D.
  • Figure 4: The Visualization of VINO-S.