Table of Contents
Fetching ...

Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts

Zhiwei Lin, Yongtao Wang, Zhi Tang

TL;DR

VL-SAM tackles open-ended object detection and segmentation without training by cascading a Vision-Language Model with a segmentation model (SAM) and using attention maps as prompts. The method builds high-quality attention maps through head aggregation and regularized attention flow, and then iteratively samples positive/negative points to guide SAM, augmented by multi-scale and question-prompt ensembles. Across LVIS and CODA, VL-SAM outperforms prior open-ended approaches and approaches upper-bound SAM performance, highlighting practical potential for real-world open-world perception. The framework is broadly generalizable, demonstrating compatibility with multiple VLMs and SAM variants and promising zero-shot deployment capabilities.

Abstract

Existing perception models achieve great success by learning from large amounts of labeled data, but they still struggle with open-world scenarios. To alleviate this issue, researchers introduce open-set perception tasks to detect or segment unseen objects in the training set. However, these models require predefined object categories as inputs during inference, which are not available in real-world scenarios. Recently, researchers pose a new and more practical problem, \textit{i.e.}, open-ended object detection, which discovers unseen objects without any object categories as inputs. In this paper, we present VL-SAM, a training-free framework that combines the generalized object recognition model (\textit{i.e.,} Vision-Language Model) with the generalized object localization model (\textit{i.e.,} Segment-Anything Model), to address the open-ended object detection and segmentation task. Without additional training, we connect these two generalized models with attention maps as the prompts. Specifically, we design an attention map generation module by employing head aggregation and a regularized attention flow to aggregate and propagate attention maps across all heads and layers in VLM, yielding high-quality attention maps. Then, we iteratively sample positive and negative points from the attention maps with a prompt generation module and send the sampled points to SAM to segment corresponding objects. Experimental results on the long-tail instance segmentation dataset (LVIS) show that our method surpasses the previous open-ended method on the object detection task and can provide additional instance segmentation masks. Besides, VL-SAM achieves favorable performance on the corner case object detection dataset (CODA), demonstrating the effectiveness of VL-SAM in real-world applications. Moreover, VL-SAM exhibits good model generalization that can incorporate various VLMs and SAMs.

Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts

TL;DR

VL-SAM tackles open-ended object detection and segmentation without training by cascading a Vision-Language Model with a segmentation model (SAM) and using attention maps as prompts. The method builds high-quality attention maps through head aggregation and regularized attention flow, and then iteratively samples positive/negative points to guide SAM, augmented by multi-scale and question-prompt ensembles. Across LVIS and CODA, VL-SAM outperforms prior open-ended approaches and approaches upper-bound SAM performance, highlighting practical potential for real-world open-world perception. The framework is broadly generalizable, demonstrating compatibility with multiple VLMs and SAM variants and promising zero-shot deployment capabilities.

Abstract

Existing perception models achieve great success by learning from large amounts of labeled data, but they still struggle with open-world scenarios. To alleviate this issue, researchers introduce open-set perception tasks to detect or segment unseen objects in the training set. However, these models require predefined object categories as inputs during inference, which are not available in real-world scenarios. Recently, researchers pose a new and more practical problem, \textit{i.e.}, open-ended object detection, which discovers unseen objects without any object categories as inputs. In this paper, we present VL-SAM, a training-free framework that combines the generalized object recognition model (\textit{i.e.,} Vision-Language Model) with the generalized object localization model (\textit{i.e.,} Segment-Anything Model), to address the open-ended object detection and segmentation task. Without additional training, we connect these two generalized models with attention maps as the prompts. Specifically, we design an attention map generation module by employing head aggregation and a regularized attention flow to aggregate and propagate attention maps across all heads and layers in VLM, yielding high-quality attention maps. Then, we iteratively sample positive and negative points from the attention maps with a prompt generation module and send the sampled points to SAM to segment corresponding objects. Experimental results on the long-tail instance segmentation dataset (LVIS) show that our method surpasses the previous open-ended method on the object detection task and can provide additional instance segmentation masks. Besides, VL-SAM achieves favorable performance on the corner case object detection dataset (CODA), demonstrating the effectiveness of VL-SAM in real-world applications. Moreover, VL-SAM exhibits good model generalization that can incorporate various VLMs and SAMs.
Paper Structure (17 sections, 3 equations, 5 figures, 5 tables)

This paper contains 17 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of VL-SAM. Without additional training, we connect the vision-language and segment-anything models with attention maps as the intermediate prompts.
  • Figure 2: An overview of VL-SAM framework. We first use VLM to describe the input image and generate all possible objects' names. Then, for each object name, we obtain the corresponding attention map with the attention map generation module. Finally, we sample point prompts from the attention map and send them to SAM to predict detection and segmentation results.
  • Figure 3: Head aggregation. We aggregate information from all attention heads with head weights.
  • Figure 4: Attention flow. We propagate attention from the first layer to last layer with attention flow.
  • Figure 5: Illustration of attention collapse. For each column, from left to right, we show image inputs, attention flow (collapse), regularized attention flow, and generated answers from VLM.