Table of Contents
Fetching ...

CIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP Generalization

Yingrui Ji, Xi Xiao, Gaofei Chen, Hao Xu, Chenrui Ma, Lijing Zhu, Aokun Liang, Jiansheng Chen

TL;DR

This work seeks to explain CLIP's strong cross-modal generalization through the Information Bottleneck lens by introducing Cross-modal Information Bottleneck (CIB). It establishes a theoretical link between CLIP's contrastive objective and cross-modal IB, then proposes Cross-modal Information Bottleneck Regularization (CIBR) that explicitly penalizes modality-specific redundancy via mutual information estimation using MINE. Empirical results across seven zero-shot datasets and two retrieval benchmarks (MSCOCO and Flickr30K) show consistent improvements over CLIP and prompt-based baselines, along with faster training dynamics and better interpretability. Overall, the paper provides both a theoretical framework and a practical regularization method to enhance cross-modal representation learning and generalization.

Abstract

Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success in cross-modal tasks such as zero-shot image classification and text-image retrieval by effectively aligning visual and textual representations. However, the theoretical foundations underlying CLIP's strong generalization remain unclear. In this work, we address this gap by proposing the Cross-modal Information Bottleneck (CIB) framework. CIB offers a principled interpretation of CLIP's contrastive learning objective as an implicit Information Bottleneck optimization. Under this view, the model maximizes shared cross-modal information while discarding modality-specific redundancies, thereby preserving essential semantic alignment across modalities. Building on this insight, we introduce a Cross-modal Information Bottleneck Regularization (CIBR) method that explicitly enforces these IB principles during training. CIBR introduces a penalty term to discourage modality-specific redundancy, thereby enhancing semantic alignment between image and text features. We validate CIBR on extensive vision-language benchmarks, including zero-shot classification across seven diverse image datasets and text-image retrieval on MSCOCO and Flickr30K. The results show consistent performance gains over standard CLIP. These findings provide the first theoretical understanding of CLIP's generalization through the IB lens. They also demonstrate practical improvements, offering guidance for future cross-modal representation learning.

CIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP Generalization

TL;DR

This work seeks to explain CLIP's strong cross-modal generalization through the Information Bottleneck lens by introducing Cross-modal Information Bottleneck (CIB). It establishes a theoretical link between CLIP's contrastive objective and cross-modal IB, then proposes Cross-modal Information Bottleneck Regularization (CIBR) that explicitly penalizes modality-specific redundancy via mutual information estimation using MINE. Empirical results across seven zero-shot datasets and two retrieval benchmarks (MSCOCO and Flickr30K) show consistent improvements over CLIP and prompt-based baselines, along with faster training dynamics and better interpretability. Overall, the paper provides both a theoretical framework and a practical regularization method to enhance cross-modal representation learning and generalization.

Abstract

Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success in cross-modal tasks such as zero-shot image classification and text-image retrieval by effectively aligning visual and textual representations. However, the theoretical foundations underlying CLIP's strong generalization remain unclear. In this work, we address this gap by proposing the Cross-modal Information Bottleneck (CIB) framework. CIB offers a principled interpretation of CLIP's contrastive learning objective as an implicit Information Bottleneck optimization. Under this view, the model maximizes shared cross-modal information while discarding modality-specific redundancies, thereby preserving essential semantic alignment across modalities. Building on this insight, we introduce a Cross-modal Information Bottleneck Regularization (CIBR) method that explicitly enforces these IB principles during training. CIBR introduces a penalty term to discourage modality-specific redundancy, thereby enhancing semantic alignment between image and text features. We validate CIBR on extensive vision-language benchmarks, including zero-shot classification across seven diverse image datasets and text-image retrieval on MSCOCO and Flickr30K. The results show consistent performance gains over standard CLIP. These findings provide the first theoretical understanding of CLIP's generalization through the IB lens. They also demonstrate practical improvements, offering guidance for future cross-modal representation learning.

Paper Structure

This paper contains 19 sections, 7 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Illustration of our Cross-modal Information Bottleneck Regularization (CIBR) strategy. Given image-text pairs, the visual and textual encoders produce embeddings $Z_v$ and $Z_t$, which are aligned through contrastive learning.
  • Figure 2: Comparison between the original CLIP architecture and our proposed CIBR framework. The left diagram shows the standard CLIP structure that aligns image-text pairs via contrastive loss. The right diagram illustrates our enhanced pipeline, where an information bottleneck-inspired regularization module explicitly suppresses modality-specific redundancy and improves cross-modal semantic alignment by maximizing the mutual information $I(Z_v; Z_t)$ while minimizing conditional redundancy.
  • Figure 3: t-SNE visualization of learned feature embeddings on the CIFAR-100 dataset.
  • Figure 4: Text-to-image retrieval performance (Recall@1) during training on MSCOCO and Flickr30K.
  • Figure 5: Ablation study on the regularization coefficient $\lambda$ in the CIBR loss.
  • ...and 4 more figures