Table of Contents
Fetching ...

High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation

Quan-Sheng Zeng, Yunheng Li, Daquan Zhou, Guanbin Li, Qibin Hou, Ming-Ming Cheng

TL;DR

The paper addresses open-vocabulary segmentation by showing that high-quality region masks are essential for robust CLIP-based regional representations. It introduces MaskCLIP++, a fine-tuning framework that uses ground-truth masks to train CLIP and a consistency alignment principle to prevent overfitting, decoupling training from mask generators. The approach yields significant gains in mask classification and open-vocabulary segmentation across multiple datasets and remains compatible with existing mask generators, improving efficiency. Overall, MaskCLIP++ offers a practical, generator-agnostic path to stronger open-vocabulary segmentation with reduced training costs and better generalization.

Abstract

Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models like Contrastive Language-Image Pre-training (CLIP). Previous approaches focus on generating masks while aligning mask features with text embeddings during training. In this paper, we observe that relying on generated low-quality masks can weaken the alignment of vision and language in regional representations. This motivates us to present a new fine-tuning framework, named MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP. Due to the limited diversity of image segmentation datasets with mask annotations, we propose incorporating a consistency alignment principle during fine-tuning, which alleviates categorical bias toward the fine-tuning dataset. After low-cost fine-tuning, MaskCLIP++ significantly improves the mask classification performance on multi-domain datasets. Combining with the mask generator in previous state-of-the-art mask-based open vocabulary segmentation methods, we achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets, respectively. Code is avaliable at https://github.com/HVision-NKU/MaskCLIPpp .

High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation

TL;DR

The paper addresses open-vocabulary segmentation by showing that high-quality region masks are essential for robust CLIP-based regional representations. It introduces MaskCLIP++, a fine-tuning framework that uses ground-truth masks to train CLIP and a consistency alignment principle to prevent overfitting, decoupling training from mask generators. The approach yields significant gains in mask classification and open-vocabulary segmentation across multiple datasets and remains compatible with existing mask generators, improving efficiency. Overall, MaskCLIP++ offers a practical, generator-agnostic path to stronger open-vocabulary segmentation with reduced training costs and better generalization.

Abstract

Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models like Contrastive Language-Image Pre-training (CLIP). Previous approaches focus on generating masks while aligning mask features with text embeddings during training. In this paper, we observe that relying on generated low-quality masks can weaken the alignment of vision and language in regional representations. This motivates us to present a new fine-tuning framework, named MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP. Due to the limited diversity of image segmentation datasets with mask annotations, we propose incorporating a consistency alignment principle during fine-tuning, which alleviates categorical bias toward the fine-tuning dataset. After low-cost fine-tuning, MaskCLIP++ significantly improves the mask classification performance on multi-domain datasets. Combining with the mask generator in previous state-of-the-art mask-based open vocabulary segmentation methods, we achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets, respectively. Code is avaliable at https://github.com/HVision-NKU/MaskCLIPpp .

Paper Structure

This paper contains 20 sections, 3 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Observations: \ref{['subfig:oracle_mask_results']} demonstrates the potential negative impact of low-quality generated masks on CLIP's mask classification learning, while \ref{['subfig:oracle_cls_results']} showcases the untapped generalization potential of existing mask generators. Mask generators are trained on the COCO lin2014mscoco, with results reported on the ADE20K zhou2017ade20k.
  • Figure 2: Comparison of training pipeline between previous mask-based OVS methods. $\mathcal{L}$ denotes the loss function. \ref{['subfig:original_segmenter_adaptation']} adapts the mask generator to CLIP and avoids overfitting by freezing CLIP. \ref{['subfig:original_clip_adaptation']} adapts CLIP to the mask generator and avoids overfitting by distillation. "T" and "S" denote the teacher and student models, respectively. Our method \ref{['subfig:our_clip_adaptation']} abandons the mask generator and avoids overfitting by consistency alignment.
  • Figure 3: Detailed framework of MaskCLIP++ for OVS tasks. The PSM represents the parameterized similarity modeling, which is designed under the principle of consistency alignment. The mask generator is only used during inference, and can be flexibly replaced.
  • Figure 4: Visualization of the weight function. The regions of interest are enclosed by yellow contours, and the deeper the red color at a position, the higher its importance to the region.
  • Figure 5: Visualizations of open-vocabulary semantic segmentation on ADE20K zhou2017ade20k. Our method produces more complete masks than CAT-Seg and fewer biases in mask classification than MAFT+.
  • ...and 2 more figures