Table of Contents
Fetching ...

Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP

Zhongxing Xu, Feilong Tang, Zhe Chen, Yingxue Su, Zhiyi Zhao, Ge Zhang, Jionglong Su, Zongyuan Ge

TL;DR

This work addresses the persistent modality gap between text and vision representations in CLIP-based weakly supervised semantic segmentation (WSSS). It introduces Vision Prototype Learning (VPL), a two-phase framework that learns class-specific vision prototypes in the vision space from text prototypes via a KL-divergence constraint and refines pseudo-labels with a Regional Semantic Contrast (RSC) module to align region embeddings with prototypes. Theoretical insights demonstrate that optimal vision prototypes reside in the vision space, and a text-space approximation cannot fully recover them; empirical results on PASCAL VOC 2012 and MS COCO 2014 show state-of-the-art segmentation performance with VPL integrated into CLIP-based methods. The approach yields more accurate localization, better object coverage, and improved robustness to modality gap, offering strong practical impact for scalable WSSS with CLIP.

Abstract

The application of Contrastive Language-Image Pre-training (CLIP) in Weakly Supervised Semantic Segmentation (WSSS) research powerful cross-modal semantic understanding capabilities. Existing methods attempt to optimize input text prompts for improved alignment of images and text, by finely adjusting text prototypes to facilitate semantic matching. Nevertheless, given the modality gap between text and vision spaces, the text prototypes employed by these methods have not effectively established a close correspondence with pixel-level vision features. In this work, our theoretical analysis indicates that the inherent modality gap results in misalignment of text and region features, and that this gap cannot be sufficiently reduced by minimizing contrast loss in CLIP. To mitigate the impact of the modality gap, we propose a Vision Prototype Learning (VPL) framework, by introducing more representative vision prototypes. The core of this framework is to learn class-specific vision prototypes in vision space with the help of text prototypes, for capturing high-quality localization maps. Moreover, we propose a regional semantic contrast module that contrasts regions embedding with corresponding prototypes, leading to more comprehensive and robust feature learning. Experimental results show that our proposed framework achieves state-of-the-art performance on two benchmark datasets.

Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP

TL;DR

This work addresses the persistent modality gap between text and vision representations in CLIP-based weakly supervised semantic segmentation (WSSS). It introduces Vision Prototype Learning (VPL), a two-phase framework that learns class-specific vision prototypes in the vision space from text prototypes via a KL-divergence constraint and refines pseudo-labels with a Regional Semantic Contrast (RSC) module to align region embeddings with prototypes. Theoretical insights demonstrate that optimal vision prototypes reside in the vision space, and a text-space approximation cannot fully recover them; empirical results on PASCAL VOC 2012 and MS COCO 2014 show state-of-the-art segmentation performance with VPL integrated into CLIP-based methods. The approach yields more accurate localization, better object coverage, and improved robustness to modality gap, offering strong practical impact for scalable WSSS with CLIP.

Abstract

The application of Contrastive Language-Image Pre-training (CLIP) in Weakly Supervised Semantic Segmentation (WSSS) research powerful cross-modal semantic understanding capabilities. Existing methods attempt to optimize input text prompts for improved alignment of images and text, by finely adjusting text prototypes to facilitate semantic matching. Nevertheless, given the modality gap between text and vision spaces, the text prototypes employed by these methods have not effectively established a close correspondence with pixel-level vision features. In this work, our theoretical analysis indicates that the inherent modality gap results in misalignment of text and region features, and that this gap cannot be sufficiently reduced by minimizing contrast loss in CLIP. To mitigate the impact of the modality gap, we propose a Vision Prototype Learning (VPL) framework, by introducing more representative vision prototypes. The core of this framework is to learn class-specific vision prototypes in vision space with the help of text prototypes, for capturing high-quality localization maps. Moreover, we propose a regional semantic contrast module that contrasts regions embedding with corresponding prototypes, leading to more comprehensive and robust feature learning. Experimental results show that our proposed framework achieves state-of-the-art performance on two benchmark datasets.
Paper Structure (11 sections, 4 theorems, 14 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 11 sections, 4 theorems, 14 equations, 4 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Define $p_{i, k}^n$ and $p_{i, k}^{n \prime}$ respectively denote the predicted probabilities for each pixel $k$ obtained by the vision prototypes $W^*$ and text prototypes $Z$. Then, we have: where $\tau_T$ represents the temperature in CLIP, while $\tau_I$ is the temperature for learning vision prototypes. If $\mathbf{z}_n^{x}=\mathbf{w}_n^{*}$, then $p_{i, k}^{n \prime}=p_{i, k}^n$ and $\tau_T

Figures (4)

  • Figure 1: The main idea proposed in this paper is to reduce the impact of the modality gap by learning vision prototypes. We show the modality gap between paired image embeddings and text prototypes. Even though the Optimized prompt minimizes contrastive loss between prototypes and object region, the modality gap is still not sufficiently reduced (as shown in (a)), which results in the text prototypes failing to accurately capture the relevant region of the target object. In contrast, we propose VPL, which ensures accurate activation of the complete object region (b).
  • Figure 2: Overview of the proposed Weakly-supervised Vision Prototype Learning (VPL), which consists of two main components: (1) Learning the vision prototype $W$ and (2) Regional semantic contrast (RSC). In phase (1), vision prototypes can be efficiently obtained by solving a convex optimization problem using gradient descent in Eq. \ref{['17']}. This ensures that it can align with vision data better to mitigate the impacts of the modality gap. The vision prototypes will then replace text prototypes to locate the target object and generate initial GradCAMs. In phase (2), we refine these GradCAMs to form pseudo-labels for supervising the decoder while CLIP encoders are frozen. Then, we obtain the masks from the vision prototypes and align them with specific region embeddings.
  • Figure 3: Qualitative results on Pascal VOC 2012 val set. (a) Input images. (b) Results from CLIP-ES. (c) Results by our CLIP-ES+VPL. (d) Ground truth. Our method produces more accurate responses and as a plug-and-play method.
  • Figure 4: Feature embedding visualizations of (a) our framework without RSC, and (b) our framework on the Pascal VOC 2012 val set using t-SNE van2008visualizing.

Theorems & Definitions (4)

  • Proposition 1
  • Theorem 1: The lower-bound of modality gap
  • Theorem 2
  • Theorem 3: Learning Vision Prototypes