Table of Contents
Fetching ...

Boosting Segment Anything Model Towards Open-Vocabulary Learning

Xumeng Han, Longhui Wei, Xuehui Yu, Zhiyang Dou, Xin He, Kuiran Wang, Yingfei Sun, Zhenjun Han, Qi Tian

TL;DR

This work presents Sambor, an end-to-end open-vocabulary object detector built atop the Segment Anything Model (SAM). It introduces SideFormer to extract SAM-aware features and inject semantic information from CLIP, an Open-set RPN to augment SAM-based proposals, and a CLIP-based open-vocabulary classifier, enabling SAM to recognize arbitrary categories while retaining zero-shot localization. Across COCO and LVIS, Sambor achieves state-of-the-art zero-shot performance and demonstrates meaningful gains from the proposed modules, particularly when the Open-set RPN is finetuned. The framework emphasizes end-to-end integration and interactive prompting, illustrating a practical path toward open-vocabulary learning with vision foundation models.

Abstract

The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we boost it to detect arbitrary objects from human inputs like category names or reference expressions. Building upon the SAM image encoder, we introduce a novel SideFormer module designed to acquire SAM features adept at perceiving objects and inject comprehensive semantic information for recognition. In addition, we devise an Open-set RPN that leverages SAM proposals to assist in finding potential objects. Consequently, Sambor enables the open-vocabulary detector to equally focus on generalizing both localization and classification sub-tasks. Our approach demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous state-of-the-art methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.

Boosting Segment Anything Model Towards Open-Vocabulary Learning

TL;DR

This work presents Sambor, an end-to-end open-vocabulary object detector built atop the Segment Anything Model (SAM). It introduces SideFormer to extract SAM-aware features and inject semantic information from CLIP, an Open-set RPN to augment SAM-based proposals, and a CLIP-based open-vocabulary classifier, enabling SAM to recognize arbitrary categories while retaining zero-shot localization. Across COCO and LVIS, Sambor achieves state-of-the-art zero-shot performance and demonstrates meaningful gains from the proposed modules, particularly when the Open-set RPN is finetuned. The framework emphasizes end-to-end integration and interactive prompting, illustrating a practical path toward open-vocabulary learning with vision foundation models.

Abstract

The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we boost it to detect arbitrary objects from human inputs like category names or reference expressions. Building upon the SAM image encoder, we introduce a novel SideFormer module designed to acquire SAM features adept at perceiving objects and inject comprehensive semantic information for recognition. In addition, we devise an Open-set RPN that leverages SAM proposals to assist in finding potential objects. Consequently, Sambor enables the open-vocabulary detector to equally focus on generalizing both localization and classification sub-tasks. Our approach demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous state-of-the-art methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.
Paper Structure (16 sections, 3 equations, 5 figures, 7 tables)

This paper contains 16 sections, 3 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: We develop an end-to-end open-vocabulary object detector called Sambor, building upon the vision foundation model SAM. Sambor enables SAM to recognize arbitrary object categories, bridging semantic gaps. It also leverages SAM's generalization and interactive capabilities to enhance zero-shot performance and extend versatility.
  • Figure 2: Overall architecture of Sambor. (Left) We adopt the SAM image encoder as the backbone and construct a SideFormer module to extract features and inject CLIP visual information for enhancing semantic understanding. Sambor is built upon a two-stage detector, with the first stage designed as an Open-set RPN that enhances the vanilla RPN using open-set proposals generated by SAM. The second stage is equipped with a CLIP language branch for parallel concept encoding, thereby endowing the detector with open-vocabulary classification. (Right) The specific implementations of the extractor and injector.
  • Figure 3: An illustration of Open-set RPN. We demonstrate two examples where SAM proposals effectively complement the vanilla RPN: (Top-Left) precise determination of object edge positions, and (Bottom-Right) clear capture of specific parts of an object, e.g., a person's clothing.
  • Figure 4: Visualization comparison between Open-set RPN and the vanilla RPN. For clarity, we only display high-quality proposals with an IoU greater than 0.7 with the ground truth boxes. In the first two examples, the vanilla RPN fails to generate proposals meeting this criterion; thus, we show the one with the highest IoU.
  • Figure 5: Visualization of Sambor for open-vocabulary object detection and instance segmentation. For better mask visual effects, we adopt HQ-SAM HQ-SAM as the mask decoder.