Table of Contents
Fetching ...

InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

Muyao Yuan, Yuanhong Zhang, Weizhan Zhang, Lan Ma, Yuan Gao, Jiangyong Ying, Yudeng Xin

TL;DR

InfoCLIP tackles open-vocabulary semantic segmentation by preserving pretrained CLIP’s vision–language alignment while adapting to pixel-level predictions. It introduces LPAM to extract fine-grained patch–text relations, an information bottleneck to denoise the alignment, and a mutual-information transfer objective to preserve alignment during asymmetric fine-tuning. The two information-theoretic losses, based on matrix-based Renyi entropy and Gram matrices, regularize and transfer the pretrained knowledge to the fine-tuned model, yielding state-of-the-art results on COCO-Stuff-based evaluation across ADE20K, PASCAL VOC, and PASCAL-Context benchmarks while maintaining efficient training. The approach demonstrates that information-driven alignment transfer can stabilize CLIP fine-tuning for dense prediction tasks and offers a general framework for preserving cross-modal structure in vision-language models.

Abstract

Recently, the strong generalization ability of CLIP has facilitated open-vocabulary semantic segmentation, which labels pixels using arbitrary text. However, existing methods that fine-tune CLIP for segmentation on limited seen categories often lead to overfitting and degrade the pretrained vision-language alignment. To stabilize modality alignment during fine-tuning, we propose InfoCLIP, which leverages an information-theoretic perspective to transfer alignment knowledge from pretrained CLIP to the segmentation task. Specifically, this transfer is guided by two novel objectives grounded in mutual information. First, we compress the pixel-text modality alignment from pretrained CLIP to reduce noise arising from its coarse-grained local semantic representations learned under image-text supervision. Second, we maximize the mutual information between the alignment knowledge of pretrained CLIP and the fine-tuned model to transfer compact local semantic relations suited for the segmentation task. Extensive evaluations across various benchmarks validate the effectiveness of InfoCLIP in enhancing CLIP fine-tuning for open-vocabulary semantic segmentation, demonstrating its adaptability and superiority in asymmetric transfer.

InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

TL;DR

InfoCLIP tackles open-vocabulary semantic segmentation by preserving pretrained CLIP’s vision–language alignment while adapting to pixel-level predictions. It introduces LPAM to extract fine-grained patch–text relations, an information bottleneck to denoise the alignment, and a mutual-information transfer objective to preserve alignment during asymmetric fine-tuning. The two information-theoretic losses, based on matrix-based Renyi entropy and Gram matrices, regularize and transfer the pretrained knowledge to the fine-tuned model, yielding state-of-the-art results on COCO-Stuff-based evaluation across ADE20K, PASCAL VOC, and PASCAL-Context benchmarks while maintaining efficient training. The approach demonstrates that information-driven alignment transfer can stabilize CLIP fine-tuning for dense prediction tasks and offers a general framework for preserving cross-modal structure in vision-language models.

Abstract

Recently, the strong generalization ability of CLIP has facilitated open-vocabulary semantic segmentation, which labels pixels using arbitrary text. However, existing methods that fine-tune CLIP for segmentation on limited seen categories often lead to overfitting and degrade the pretrained vision-language alignment. To stabilize modality alignment during fine-tuning, we propose InfoCLIP, which leverages an information-theoretic perspective to transfer alignment knowledge from pretrained CLIP to the segmentation task. Specifically, this transfer is guided by two novel objectives grounded in mutual information. First, we compress the pixel-text modality alignment from pretrained CLIP to reduce noise arising from its coarse-grained local semantic representations learned under image-text supervision. Second, we maximize the mutual information between the alignment knowledge of pretrained CLIP and the fine-tuned model to transfer compact local semantic relations suited for the segmentation task. Extensive evaluations across various benchmarks validate the effectiveness of InfoCLIP in enhancing CLIP fine-tuning for open-vocabulary semantic segmentation, demonstrating its adaptability and superiority in asymmetric transfer.

Paper Structure

This paper contains 27 sections, 15 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Motivation of InfoCLIP. To leverage the valuable yet noisy pixel-text modality alignment from the pretrained CLIP for enhancing OVSS, we: (1) denoise the pretrained alignment through compression to extract semantic-aware alignment; and (2) transfer more generalized semantic alignment via distillation to alleviate the narrowing of the modality alignment space.
  • Figure 2: Overview of InfoCLIP. To exploit the valuable yet noisy pixel-text alignment from a pretrained foundation model (CLIP) for OVSS, InfoCLIP introduces an information-theoretic framework for asymmetric adaptation, comprising: (1) a Learnable Pixel-Text Alignment Module (LPAM) to extract fine-grained patch-text relations; (2) an information bottleneck loss to suppress noise and retain semantic-aware alignment; and (3) a mutual information transfer loss to preserve modality alignment by bridging pretrained and fine-tuned CLIP representations. The detailed formulation of LPAM is provided in the Appendix.
  • Figure 3: Effectiveness of alignment distillation. We present the t-SNE visualization of CLIP image embeddings. As highlighted in the red boxes, while a state-of-the-art method confuses the features of the seen class chair and the unseen class armchair, our method differentiates them and alleviates overfitting to seen classes, benefiting from the pretrained knowledge distilled from the teacher CLIP model.
  • Figure 4: Effectiveness of semantic alignment extraction and compression. Semantic compression denoises the pixel-text alignments extracted from the pretrained model, resulting in a sharper focus on the semantic center. From left to right: examples corresponding to the classes car, boat, building-other, bus, and person.
  • Figure 5: Hyperparameter sensitivity analysis of $\lambda_1$ and $\lambda_2$ balancing $\mathcal{L}_c$ and $\mathcal{L}_d$ on the A-150 and PC-59 datasets.
  • ...and 5 more figures