Table of Contents
Fetching ...

Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

Yun Xing, Jian Kang, Aoran Xiao, Jiahao Nie, Ling Shao, Shijian Lu

TL;DR

This work addresses the cross-modal semantic gap in language-supervised semantic segmentation, where captions frequently miss many visual concepts present in images. It introduces Concept Curation (CoCu), a three-stage pipeline—vision-driven expansion, text-to-vision-guided ranking, and cluster-guided sampling—that enriches textual concepts using CLIP and cross-image retrieval, mitigating bias toward salient concepts. Integrated with the GroupViT framework, CoCu yields state-of-the-art zero-shot segmentation across eight benchmarks and accelerates training convergence by providing richer, more balanced concept supervision. The approach strengthens open-vocabulary segmentation by aligning visual and textual semantics during pre-training and suggests potential extensions to detection and instance segmentation tasks.

Abstract

Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of bridging semantic gap in pre-training data.

Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

TL;DR

This work addresses the cross-modal semantic gap in language-supervised semantic segmentation, where captions frequently miss many visual concepts present in images. It introduces Concept Curation (CoCu), a three-stage pipeline—vision-driven expansion, text-to-vision-guided ranking, and cluster-guided sampling—that enriches textual concepts using CLIP and cross-image retrieval, mitigating bias toward salient concepts. Integrated with the GroupViT framework, CoCu yields state-of-the-art zero-shot segmentation across eight benchmarks and accelerates training convergence by providing richer, more balanced concept supervision. The approach strengthens open-vocabulary segmentation by aligning visual and textual semantics during pre-training and suggests potential extensions to detection and instance segmentation tasks.

Abstract

Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of bridging semantic gap in pre-training data.
Paper Structure (12 sections, 8 equations, 4 figures, 4 tables)

This paper contains 12 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Cross-modal semantic gap is prevalent in web-crawled image-text pairs. As in (a), the caption text often captures certain salient visual concepts only in the paired image but misses many others (i.e., 'person', 'grass', and 'sky') that are also useful in image-text modeling. Leveraging CLIP radford2021learning, more useful visual concepts could be captured via image-to-text retrieval, but the retrieved captions usually suffer from the semantic bias as in (b) (i.e., 'person' recovered but 'grass' and 'sky' still missing). Our proposed Concept Curation (CoCu) bridges the cross-modal semantic gap effectively by vision-driven expansion, text-to-vision-guided ranking and cluster-guided sampling while avoiding the negative effect by semantic bias, as illustrated in (c). Best viewed in color.
  • Figure 2: Illustration of vision-driven expansion (above) and text-to-image-guided ranking (below) in CoCu. To compensate for missing semantics, vision-driven expansion establishes an archive of potential matched concepts through image-to-image retrieval, while text-to-vision-guided ranking scores retrieved concepts based on assigned relevancy. The textual concepts can later be identified in pre-training by sampling. In the figure, images with a blue border $\square$ are retrieved via expanded concepts (marked as blue) using their paired captions, while images with a red border $\square$ represent images for curation (as anchor). Best viewed in color.
  • Figure 3: CoCu enhances training convergence. (a) The training loss curves of GroupViT and CoCu demonstrate that CoCu significantly accelerates pre-training convergence. (b) CoCu achieves superior binary segmentation results (second row) compared to GroupViT (first row) for the concept of "grass," which is missing in the caption, using an example image captioned as "a red fox drinking water." Best viewed in color.
  • Figure 4: Visualization of activation heatmaps. GroupViT fails to activate on corresponding visual regions for concepts not represented in captions, while CoCu exhibits significantly better localization. High activation is shown as red, and low activation is displayed as blue. Best viewed in color.