Table of Contents
Fetching ...

Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation

Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang, Chun-Pei Chen, Yu-Lun Liu, Min-Hung Chen, Hou-Ning Hu, Yung-Yu Chuang, Yen-Yu Lin

TL;DR

This paper addresses the challenge of text-supervised semantic segmentation by bridging the gap between text semantics and segmentation units. It introduces Image-Text Co-Decomposition (CoDe), which jointly decomposes images into regions and text into word segments, and uses region-word contrastive learning to align them. A prompt-learning mechanism for region and word highlighting mitigates domain shift when using masked inputs, improving feature extraction in vision-language models. Across six benchmark datasets, CoDe achieves state-of-the-art zero-shot segmentation performance and demonstrates the value of jointly decomposing text for more precise region-level supervision.

Abstract

This paper addresses text-supervised semantic segmentation, aiming to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. We notice that there is a discrepancy between text alignment and semantic segmentation: A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments. To address this issue, we propose a novel framework, Image-Text Co-Decomposition (CoDe), where the paired image and text are jointly decomposed into a set of image regions and a set of word segments, respectively, and contrastive learning is developed to enforce region-word alignment. To work with a vision-language model, we present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest, with which more effective features can be extracted from that segment. Comprehensive experimental results demonstrate that our method performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets.

Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation

TL;DR

This paper addresses the challenge of text-supervised semantic segmentation by bridging the gap between text semantics and segmentation units. It introduces Image-Text Co-Decomposition (CoDe), which jointly decomposes images into regions and text into word segments, and uses region-word contrastive learning to align them. A prompt-learning mechanism for region and word highlighting mitigates domain shift when using masked inputs, improving feature extraction in vision-language models. Across six benchmark datasets, CoDe achieves state-of-the-art zero-shot segmentation performance and demonstrates the value of jointly decomposing text for more precise region-level supervision.

Abstract

This paper addresses text-supervised semantic segmentation, aiming to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. We notice that there is a discrepancy between text alignment and semantic segmentation: A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments. To address this issue, we propose a novel framework, Image-Text Co-Decomposition (CoDe), where the paired image and text are jointly decomposed into a set of image regions and a set of word segments, respectively, and contrastive learning is developed to enforce region-word alignment. To work with a vision-language model, we present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest, with which more effective features can be extracted from that segment. Comprehensive experimental results demonstrate that our method performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets.
Paper Structure (40 sections, 8 equations, 7 figures, 3 tables)

This paper contains 40 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Existing methods perform text-supervised semantic segmentation by learning either (a) image-text alignment or (b) region-text alignment. This paper presents (c) region-word alignment via image-text co-decomposition, where the image and the text are decomposed into object regions and word segments, respectively, while contrastive learning is used to establish cross-modal correspondences between these image and word segments.
  • Figure 2: Training pipeline of our method for image-text co-decomposition. Our method consists of three major modules, including (a) the image-text co-segmentation module where the image and text segmenters estimate the region and word masks according to a selected noun, respectively, (b) the region-word highlighting module where the estimated masks together with two learnable prompts produce the highlighted image and text, and (c) the region-word alignment module where contrastive learning is applied to the embedded object regions and word segments to accomplish region-word alignment.
  • Figure 3: Qualitative comparisons. The proposed method is compared with the two most competitive methods, TCL cha2022tcl and SimSeg yi2023simple, on PASCAL VOC, PASCAL Context, and COCO Object datasets. Our method provides more precise object boundaries and effectively localizes objects within images without misclassification, leading to more accurate segmentation.
  • Figure 4: Visualization of the results of our image-text co-decomposition method. The first two rows display text and images, representing input image-text pairs. In each text, nouns are underlined with different colors. Our method uses these nouns as queries for performing image-text co-decomposition. Using our image-text co-decomposition method, the last two rows depict the method's output, where regions and word segments associated with different nouns appear in corresponding colors.
  • Figure 5: Ablation studies. We improve the baseline model by incrementally including (C.) the image-text co-decomposition module, (W.) the word highlighting prompt, and (R.) the region highlighting prompt. We present the segmentation results of the resulting models on the images of the PASCAL VOC everingham2010pascal dataset.
  • ...and 2 more figures