Table of Contents
Fetching ...

Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation

Yongkang Li, Tianheng Cheng, Bin Feng, Wenyu Liu, Xinggang Wang

TL;DR

Mask-Adapter tackles a core bottleneck in open-vocabulary segmentation: the misalignment between CLIP embeddings and mask-based predictions. By extracting semantic activation maps from proposal masks and CLIP features, it produces richer, context-aware mask embeddings that improve mask-text alignment and discrimination. The approach introduces a robust IoU-based matcher, a mask-consistency loss, and a two-stage training regime (GT warmup followed by mixed-mask training), culminating in strong gains across ADE20K, Pascal-Context, and SAM extensions. Practically, Mask-Adapter is plug-and-play for existing mask-pooling methods and significantly enhances zero-shot segmentation performance while preserving CLIP's open-vocabulary capabilities, with potential broad impact on dense OV tasks and beyond $L = \lambda_{ce}L_{ce} + \lambda_{cos}L_{cos}$.

Abstract

Recent open-vocabulary segmentation methods adopt mask generators to predict segmentation masks and leverage pre-trained vision-language models, e.g., CLIP, to classify these masks via mask pooling. Although these approaches show promising results, it is counterintuitive that accurate masks often fail to yield accurate classification results through pooling CLIP image embeddings within the mask regions. In this paper, we reveal the performance limitations of mask pooling and introduce Mask-Adapter, a simple yet effective method to address these challenges in open-vocabulary segmentation. Compared to directly using proposal masks, our proposed Mask-Adapter extracts semantic activation maps from proposal masks, providing richer contextual information and ensuring alignment between masks and CLIP. Additionally, we propose a mask consistency loss that encourages proposal masks with similar IoUs to obtain similar CLIP embeddings to enhance models' robustness to varying predicted masks. Mask-Adapter integrates seamlessly into open-vocabulary segmentation methods based on mask pooling in a plug-and-play manner, delivering more accurate classification results. Extensive experiments across several zero-shot benchmarks demonstrate significant performance gains for the proposed Mask-Adapter on several well-established methods. Notably, Mask-Adapter also extends effectively to SAM and achieves impressive results on several open-vocabulary segmentation datasets. Code and models are available at https://github.com/hustvl/MaskAdapter.

Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation

TL;DR

Mask-Adapter tackles a core bottleneck in open-vocabulary segmentation: the misalignment between CLIP embeddings and mask-based predictions. By extracting semantic activation maps from proposal masks and CLIP features, it produces richer, context-aware mask embeddings that improve mask-text alignment and discrimination. The approach introduces a robust IoU-based matcher, a mask-consistency loss, and a two-stage training regime (GT warmup followed by mixed-mask training), culminating in strong gains across ADE20K, Pascal-Context, and SAM extensions. Practically, Mask-Adapter is plug-and-play for existing mask-pooling methods and significantly enhances zero-shot segmentation performance while preserving CLIP's open-vocabulary capabilities, with potential broad impact on dense OV tasks and beyond .

Abstract

Recent open-vocabulary segmentation methods adopt mask generators to predict segmentation masks and leverage pre-trained vision-language models, e.g., CLIP, to classify these masks via mask pooling. Although these approaches show promising results, it is counterintuitive that accurate masks often fail to yield accurate classification results through pooling CLIP image embeddings within the mask regions. In this paper, we reveal the performance limitations of mask pooling and introduce Mask-Adapter, a simple yet effective method to address these challenges in open-vocabulary segmentation. Compared to directly using proposal masks, our proposed Mask-Adapter extracts semantic activation maps from proposal masks, providing richer contextual information and ensuring alignment between masks and CLIP. Additionally, we propose a mask consistency loss that encourages proposal masks with similar IoUs to obtain similar CLIP embeddings to enhance models' robustness to varying predicted masks. Mask-Adapter integrates seamlessly into open-vocabulary segmentation methods based on mask pooling in a plug-and-play manner, delivering more accurate classification results. Extensive experiments across several zero-shot benchmarks demonstrate significant performance gains for the proposed Mask-Adapter on several well-established methods. Notably, Mask-Adapter also extends effectively to SAM and achieves impressive results on several open-vocabulary segmentation datasets. Code and models are available at https://github.com/hustvl/MaskAdapter.

Paper Structure

This paper contains 42 sections, 8 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Upper bound analysis of mask embedding extraction methods. Using ADE20K ground-truth masks as input, we evaluate the upper bound of different extraction methods. Although mask cropping and mask pooling show limited performance with ground-truth masks, our Mask-Adapter significantly enhances the upper bound for open-vocabulary segmentation.
  • Figure 2: Comparison of mask embedding extraction methods. (a) Mask Cropping: cropping the segmented region from the image and feeding it into CLIP to extract mask embeddings. (b) Mask Pooling: aggregating region features with the proposal masks. (c) Mask-Adapter: the proposal masks and CLIP features are passed through Mask-Adapter to extract semantic activation maps, which are then used to construct mask embeddings by aggregating CLIP features based on these highlighted regions and contextual information.
  • Figure 3: Overview of Mask-Adapter. (a) Mask-Adapter for Open-Vocabulary Segmentation. Mask-Adapter can be seamlessly integrated into open-vocabulary segmentation methods based on mask pooling. Mask-Adapter extracts semantic activation maps from CLIP features and proposal masks. Mask embeddings are aggregated according to semantic activation maps, which provide richer contextual and semantic information. The aggregated mask embeddings are then matched with text embeddings to perform mask classification. During training, only the Mask-Adapter is trainable. (b) Details of the Mask-Adapter. After patchifying the masks, masks and CLIP features are fused and processed through ConvNeXt blocks, ultimately outputting the semantic activation maps through a predictor.
  • Figure 4: Visualizations of Semantic Activation Maps. We present visualizations of the semantic activation maps and their corresponding segmentation masks. For each input image, the top row shows the semantic activation maps, while the bottom row displays the segmentation masks. The semantic activation maps emphasize the most discriminative regions of the mask. Best viewed on screen after zooming in.
  • Figure 5: t-SNE visualization of mask embeddings from different extraction methods. Mask embeddings extracted using the Mask-Adapter demonstrate better separability compared to those obtained by mask pooling.
  • ...and 6 more figures