Table of Contents
Fetching ...

PosSAM: Panoptic Open-vocabulary Segment Anything

Vibashan VS, Shubhankar Borse, Hyojin Park, Debasmit Das, Vishal Patel, Munawar Hayat, Fatih Porikli

TL;DR

PosSAM tackles open-vocabulary panoptic segmentation by unifying the Segment Anything Model (SAM) with CLIP in an end-to-end framework. It introduces Local Discriminative Pooling (LDP) to fuse SAM's spatially rich features with CLIP's semantic embeddings and Mask-Aware Selective Ensemble (MASE) to distinguish seen versus unseen classes during inference. The method uses a frozen SAM encoder, a Mask2Former-style mask decoder, and a learning objective combining mask quality, IoU, and classification losses. Experiments on COCO↔ADE20K demonstrate state-of-the-art performance and strong cross-dataset generalization, with significant PQ gains over prior open-vocabulary panoptic methods.

Abstract

In this paper, we introduce an open-vocabulary panoptic segmentation model that effectively unifies the strengths of the Segment Anything Model (SAM) with the vision-language CLIP model in an end-to-end framework. While SAM excels in generating spatially-aware masks, it's decoder falls short in recognizing object class information and tends to oversegment without additional guidance. Existing approaches address this limitation by using multi-stage techniques and employing separate models to generate class-aware prompts, such as bounding boxes or segmentation masks. Our proposed method, PosSAM is an end-to-end model which leverages SAM's spatially rich features to produce instance-aware masks and harnesses CLIP's semantically discriminative features for effective instance classification. Specifically, we address the limitations of SAM and propose a novel Local Discriminative Pooling (LDP) module leveraging class-agnostic SAM and class-aware CLIP features for unbiased open-vocabulary classification. Furthermore, we introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image. We conducted extensive experiments to demonstrate our methods strong generalization properties across multiple datasets, achieving state-of-the-art performance with substantial improvements over SOTA open-vocabulary panoptic segmentation methods. In both COCO to ADE20K and ADE20K to COCO settings, PosSAM outperforms the previous state-of-the-art methods by a large margin, 2.4 PQ and 4.6 PQ, respectively. Project Website: https://vibashan.github.io/possam-web/.

PosSAM: Panoptic Open-vocabulary Segment Anything

TL;DR

PosSAM tackles open-vocabulary panoptic segmentation by unifying the Segment Anything Model (SAM) with CLIP in an end-to-end framework. It introduces Local Discriminative Pooling (LDP) to fuse SAM's spatially rich features with CLIP's semantic embeddings and Mask-Aware Selective Ensemble (MASE) to distinguish seen versus unseen classes during inference. The method uses a frozen SAM encoder, a Mask2Former-style mask decoder, and a learning objective combining mask quality, IoU, and classification losses. Experiments on COCO↔ADE20K demonstrate state-of-the-art performance and strong cross-dataset generalization, with significant PQ gains over prior open-vocabulary panoptic methods.

Abstract

In this paper, we introduce an open-vocabulary panoptic segmentation model that effectively unifies the strengths of the Segment Anything Model (SAM) with the vision-language CLIP model in an end-to-end framework. While SAM excels in generating spatially-aware masks, it's decoder falls short in recognizing object class information and tends to oversegment without additional guidance. Existing approaches address this limitation by using multi-stage techniques and employing separate models to generate class-aware prompts, such as bounding boxes or segmentation masks. Our proposed method, PosSAM is an end-to-end model which leverages SAM's spatially rich features to produce instance-aware masks and harnesses CLIP's semantically discriminative features for effective instance classification. Specifically, we address the limitations of SAM and propose a novel Local Discriminative Pooling (LDP) module leveraging class-agnostic SAM and class-aware CLIP features for unbiased open-vocabulary classification. Furthermore, we introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image. We conducted extensive experiments to demonstrate our methods strong generalization properties across multiple datasets, achieving state-of-the-art performance with substantial improvements over SOTA open-vocabulary panoptic segmentation methods. In both COCO to ADE20K and ADE20K to COCO settings, PosSAM outperforms the previous state-of-the-art methods by a large margin, 2.4 PQ and 4.6 PQ, respectively. Project Website: https://vibashan.github.io/possam-web/.
Paper Structure (19 sections, 4 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 19 sections, 4 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Left: While SAM possesses exceptional spatial awareness and promptable segmentation capabilities, it lacks class/semantic awareness and tends to over-segment objects into multiple regions. Our proposed PosSAM enhances SAM with instance and class awareness by efficiently integrating SAM's representations with semantically discriminative CLIP embeddings, resulting in robust open-vocabulary panoptic segmentation. Right: As shown, we achieve state-of-the-art performance in both COCO to ADE20K and ADE20K to COCO settings.
  • Figure 2: Visualization of K-means clustering of frozen CLIP radford2021learning and SAM kirillov2023segment backbone features. As illustrated, SAM's clustering maps show a higher precision in object localization when compared to the cluster map of CLIP. While SAM lacks instance awareness, it still produces more defined boundaries between parts of objects, indicating its enhanced spatial awareness and fine-grained representation learning capabilities.
  • Figure 3: Overview of our PosSAM training pipeline. We first encode the input image using the SAM backbone to extract spatially rich features, which are processed through a Feature Pyramid Network to obtain hierarchical multi-scale features decoded to form mask features and predict class-agnostic masks. Concurrently, we train an IoU predictor for each mask to measure its quality. We obtain CLIP image features and develop our proposed LDP module to achieve better classification of these masks. We learn to pool and generate enhanced class-specific features, which are then classified by a process supervised with ground truth category labels derived from the CLIP text encoder.
  • Figure 4: In the inference pipeline, the LDP embeddings and CLIP embeddings are generated from the local discriminative pooling module and mask pool module, respectively. These embeddings are used to classify the mask proposal by performing a product with pre-computed CLIP text embeddings. Final predictions are processed by our MASE strategy, where IoU score is utilized to weigh the classification predictions and an adaptive geometric ensemble is applied to the outputs of the LDP and CLIP embeddings.
  • Figure 5: Local Discriminative Pooling Module. The LDP module learns to effectively fuse the class-agnostic SAM features with discriminative CLIP features, thereby avoiding overfitting to seen classes during training, which is crucial for OV inference.
  • ...and 4 more figures