Table of Contents
Fetching ...

Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

Zhaoqing Wang, Xiaobo Xia, Ziye Chen, Xiao He, Yandong Guo, Mingming Gong, Tongliang Liu

TL;DR

The paper introduces Unpair-Seg, a weakly-supervised framework for open-vocabulary segmentation that learns from unpaired image-mask and image-text data. It generates masks from image-mask pairs, uses a vision-language model to re-caption images and extract entities, and applies bipartite matching in CLIP space along with a multi-scale feature adapter to align region and text embeddings. This approach enables open-vocabulary semantic and panoptic segmentation with competitive results, significantly narrowing the gap to fully-supervised methods while reducing annotation requirements. The method demonstrates robustness through ablations and shows strong performance across diverse segmentation tasks and datasets.

Abstract

Current state-of-the-art open-vocabulary segmentation methods typically rely on image-mask-text triplet annotations for supervision. However, acquiring such detailed annotations is labour-intensive and poses scalability challenges in complex real-world scenarios. While existing weakly-supervised approaches leverage image-text pairs to reduce the expansive annotation cost, the lack of mask supervision makes it difficult for the model to locate multiple instances and accurately group pixels with similar semantics, significantly hampering versatility and performance. In this paper, we introduce Unpair-Seg, a novel weakly-supervised open-vocabulary segmentation framework that learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected. Unpair-Seg initially predicts a set of binary masks and generates pseudo labels by identifying confident pairs of masks and text entities. We then train a feature adapter to align region embeddings with text embeddings based on these pseudo labels, achieving open-vocabulary segmentation. However, the inherent noise in the mask-entity correspondence poses a challenge to obtaining reliable pairs. To address this, we employ a vision-language large model to re-caption the input images and extract precise entities, and we design a multi-scale matching strategy to reduce noisy mask-entity pairs. Our Unpair-Seg framework demonstrates impressive performance, achieving 14.6\% and 19.5\% mIoU on the ADE-847 and PASCAL Context-459 datasets, significantly narrowing the gap between fully-supervised and weakly-supervised methods.

Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

TL;DR

The paper introduces Unpair-Seg, a weakly-supervised framework for open-vocabulary segmentation that learns from unpaired image-mask and image-text data. It generates masks from image-mask pairs, uses a vision-language model to re-caption images and extract entities, and applies bipartite matching in CLIP space along with a multi-scale feature adapter to align region and text embeddings. This approach enables open-vocabulary semantic and panoptic segmentation with competitive results, significantly narrowing the gap to fully-supervised methods while reducing annotation requirements. The method demonstrates robustness through ablations and shows strong performance across diverse segmentation tasks and datasets.

Abstract

Current state-of-the-art open-vocabulary segmentation methods typically rely on image-mask-text triplet annotations for supervision. However, acquiring such detailed annotations is labour-intensive and poses scalability challenges in complex real-world scenarios. While existing weakly-supervised approaches leverage image-text pairs to reduce the expansive annotation cost, the lack of mask supervision makes it difficult for the model to locate multiple instances and accurately group pixels with similar semantics, significantly hampering versatility and performance. In this paper, we introduce Unpair-Seg, a novel weakly-supervised open-vocabulary segmentation framework that learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected. Unpair-Seg initially predicts a set of binary masks and generates pseudo labels by identifying confident pairs of masks and text entities. We then train a feature adapter to align region embeddings with text embeddings based on these pseudo labels, achieving open-vocabulary segmentation. However, the inherent noise in the mask-entity correspondence poses a challenge to obtaining reliable pairs. To address this, we employ a vision-language large model to re-caption the input images and extract precise entities, and we design a multi-scale matching strategy to reduce noisy mask-entity pairs. Our Unpair-Seg framework demonstrates impressive performance, achieving 14.6\% and 19.5\% mIoU on the ADE-847 and PASCAL Context-459 datasets, significantly narrowing the gap between fully-supervised and weakly-supervised methods.
Paper Structure (16 sections, 8 equations, 18 figures, 10 tables)

This paper contains 16 sections, 8 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Unpair-Seg framework directly learns from unpaired mask-text supervision. Unlike labour-intensive image-mask-text annotations, independent image-mask and image-text pairs are more accessible to collect. With a single set of weights, Unpair-Seg excels at various image segmentation tasks, including point-prompt, box-prompt, open-vocabulary semantic and panoptic segmentation. Extensive experimental results demonstrate that our method significantly narrows the gap between fully-supervised and weakly-supervised approaches.
  • Figure 2: Overview of the proposed Unpair-Seg framework. Our framework consists of two stages, including mask generation and mask-entity alignment. Given image-mask pairs, we first train a prompt encoder, pixel decoder, and mask decoder for binary mask generation. Subsequently, when presented with image-text pairs, a feature adapter is optimized to align regional embeddings of predicted masks and entity embeddings of text descriptions. A mask-entity bipartite matching is designed to assign the corresponding mask prediction for each entity. CLIP visual and text encoders are frozen. Visual prompts using boxes are omitted for simplicity.
  • Figure 3: Comparison between raw and improved text descriptions. "Misalign.", "Deficient.", and "Missing." denote text-image misalignment, deficient description, and missing text description.
  • Figure 4: Visualisaton. We show prediction results on three tasks: promptable, open-vocabulary semantic, and open-vocabulary panoptic segmentation. The results are best viewed in color.
  • Figure 5: Architecture of the mask decoder layer. This decoder layer updates both visual prompt embeddings and pixel features by the cross-attention layers. The self-attention layer is used to update visual prompts. At each attention layer, positional encodings are added to the pixel features, and the entire original visual prompts (including position encoding) are added to the updated visual prompts.
  • ...and 13 more figures