Table of Contents
Fetching ...

VL-SAM-V2: Open-World Object Detection with General and Specific Query Fusion

Zhiwei Lin, Yongtao Wang

TL;DR

VL-SAM-V2 tackles open-world object detection by fusing general queries from open-ended vision-language models with specific queries from open-set detectors. It introduces ranked learnable queries and a denoising point training strategy within a general and specific query fusion module, enabling dynamic open-set or open-ended evaluation. Across LVIS, it achieves state-of-the-art results for both open-set and open-ended settings, with pronounced gains on rare object categories, and demonstrates strong generalization across backbones and vision-language models. The framework can be effectively combined with SAM for open-world segmentation, indicating practical applicability to labeling and discovery in real-world imagery.

Abstract

Current perception models have achieved remarkable success by leveraging large-scale labeled datasets, but still face challenges in open-world environments with novel objects. To address this limitation, researchers introduce open-set perception models to detect or segment arbitrary test-time user-input categories. However, open-set models rely on human involvement to provide predefined object categories as input during inference. More recently, researchers have framed a more realistic and challenging task known as open-ended perception that aims to discover unseen objects without requiring any category-level input from humans at inference time. Nevertheless, open-ended models suffer from low performance compared to open-set models. In this paper, we present VL-SAM-V2, an open-world object detection framework that is capable of discovering unseen objects while achieving favorable performance. To achieve this, we combine queries from open-set and open-ended models and propose a general and specific query fusion module to allow different queries to interact. By adjusting queries from open-set models, we enable VL-SAM-V2 to be evaluated in the open-set or open-ended mode. In addition, to learn more diverse queries, we introduce ranked learnable queries to match queries with proposals from open-ended models by sorting. Moreover, we design a denoising point training strategy to facilitate the training process. Experimental results on LVIS show that our method surpasses the previous open-set and open-ended methods, especially on rare objects.

VL-SAM-V2: Open-World Object Detection with General and Specific Query Fusion

TL;DR

VL-SAM-V2 tackles open-world object detection by fusing general queries from open-ended vision-language models with specific queries from open-set detectors. It introduces ranked learnable queries and a denoising point training strategy within a general and specific query fusion module, enabling dynamic open-set or open-ended evaluation. Across LVIS, it achieves state-of-the-art results for both open-set and open-ended settings, with pronounced gains on rare object categories, and demonstrates strong generalization across backbones and vision-language models. The framework can be effectively combined with SAM for open-world segmentation, indicating practical applicability to labeling and discovery in real-world imagery.

Abstract

Current perception models have achieved remarkable success by leveraging large-scale labeled datasets, but still face challenges in open-world environments with novel objects. To address this limitation, researchers introduce open-set perception models to detect or segment arbitrary test-time user-input categories. However, open-set models rely on human involvement to provide predefined object categories as input during inference. More recently, researchers have framed a more realistic and challenging task known as open-ended perception that aims to discover unseen objects without requiring any category-level input from humans at inference time. Nevertheless, open-ended models suffer from low performance compared to open-set models. In this paper, we present VL-SAM-V2, an open-world object detection framework that is capable of discovering unseen objects while achieving favorable performance. To achieve this, we combine queries from open-set and open-ended models and propose a general and specific query fusion module to allow different queries to interact. By adjusting queries from open-set models, we enable VL-SAM-V2 to be evaluated in the open-set or open-ended mode. In addition, to learn more diverse queries, we introduce ranked learnable queries to match queries with proposals from open-ended models by sorting. Moreover, we design a denoising point training strategy to facilitate the training process. Experimental results on LVIS show that our method surpasses the previous open-set and open-ended methods, especially on rare objects.

Paper Structure

This paper contains 23 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of VL-SAM-V2. VL-SAM-V2 combines the general queries from VL-SAM and the specific queries of an open-set model with a query fusion module.
  • Figure 2: The overall pipeline of VL-SAM-V2. VL-SAM-V2 utilizes a vision-language model to generate general queries and a standard open-set detection model to generate specific queries. Then, the two distinct queries are sent to the general and specific query fusion module for interaction. Finally, a box head and an optional SAM are applied to predict the perception results. During the training, we only fine-tune the general and specific query fusion module and the box head. In addition, by controlling the predefined object category list, VL-SAM-V2 can operate in open-ended mode.
  • Figure 3: Illustration of general and specific query fusion module. General and specific queries interact with a self-attention mechanism. Then, the shared query-to-text and query-to-image cross-attention are applied for the two queries independently. Finally, the unshared box heads predict the offset of corresponding bounding boxes. During the training, we only update the parameters in the self-attention and box heads.
  • Figure 4: Visualization results VL-SAM-V2 combining with SAM on CODA li2022coda. We show input images and detection and segmentation prediction results in the open-ended mode. VL-SAM-V2 can discover various uncommon objects. Best viewed by zooming in.