Table of Contents
Fetching ...

Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift

Yihao Xue, Siddharth Joshi, Dang Nguyen, Baharan Mirzasoleiman

TL;DR

The paper addresses why multimodal contrastive learning (MMCL) yields strong zero-shot robustness under distribution shift and introduces a theoretical framework comparing MMCL with supervised learning (SL). It identifies two key mechanisms—intra-class contrasting and inter-class feature sharing—enabled by rich captions, and derives bounds showing MMCL can significantly outperform SL on out-of-distribution data. The authors validate these claims with synthetic analyses and experiments on MSCOCO, Conceptual Captions, and shifted ImageNet, demonstrating that caption richness is crucial and that MMCL’s robustness stems from its cross-modal signaling and loss structure. The work highlights practical implications for loss design and data curation to enhance OOD resilience in multimodal systems like CLIP.

Abstract

Recently, multimodal contrastive learning (MMCL) approaches, such as CLIP, have achieved a remarkable success in learning representations that are robust against distribution shift and generalize to new domains. Despite the empirical success, the mechanism behind learning such generalizable representations is not understood. In this work, we rigorously analyze this problem and uncover two mechanisms behind MMCL's robustness: \emph{intra-class contrasting}, which allows the model to learn features with a high variance, and \emph{inter-class feature sharing}, where annotated details in one class help learning other classes better. Both mechanisms prevent spurious features that are over-represented in the training data to overshadow the generalizable core features. This yields superior zero-shot classification accuracy under distribution shift. Furthermore, we theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions. We validate our theoretical findings through experiments, including a well-designed synthetic experiment and an experiment involving training CLIP models on MSCOCO/Conceptual Captions and evaluating them on shifted ImageNets.

Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift

TL;DR

The paper addresses why multimodal contrastive learning (MMCL) yields strong zero-shot robustness under distribution shift and introduces a theoretical framework comparing MMCL with supervised learning (SL). It identifies two key mechanisms—intra-class contrasting and inter-class feature sharing—enabled by rich captions, and derives bounds showing MMCL can significantly outperform SL on out-of-distribution data. The authors validate these claims with synthetic analyses and experiments on MSCOCO, Conceptual Captions, and shifted ImageNet, demonstrating that caption richness is crucial and that MMCL’s robustness stems from its cross-modal signaling and loss structure. The work highlights practical implications for loss design and data curation to enhance OOD resilience in multimodal systems like CLIP.

Abstract

Recently, multimodal contrastive learning (MMCL) approaches, such as CLIP, have achieved a remarkable success in learning representations that are robust against distribution shift and generalize to new domains. Despite the empirical success, the mechanism behind learning such generalizable representations is not understood. In this work, we rigorously analyze this problem and uncover two mechanisms behind MMCL's robustness: \emph{intra-class contrasting}, which allows the model to learn features with a high variance, and \emph{inter-class feature sharing}, where annotated details in one class help learning other classes better. Both mechanisms prevent spurious features that are over-represented in the training data to overshadow the generalizable core features. This yields superior zero-shot classification accuracy under distribution shift. Furthermore, we theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions. We validate our theoretical findings through experiments, including a well-designed synthetic experiment and an experiment involving training CLIP models on MSCOCO/Conceptual Captions and evaluating them on shifted ImageNets.
Paper Structure (45 sections, 28 theorems, 97 equations, 7 figures)

This paper contains 45 sections, 28 theorems, 97 equations, 7 figures.

Key Result

Lemma 3.2

Given an image with feature $\pmb{\mu}'$ and a text with feature $\pmb{\mu}"$, the similarity (inner product of representations) between them, computed using encoders trained on the training set, is: $\textbf{similarity score} \approx \pmb{\mu}'^\top \pmb{C}^{Tr} \pmb{\mu}" = \sum_{i=1}^l\sum_{j=1}^

Figures (7)

  • Figure 1: Construction of captions.
  • Figure 2: OOD accuracy on the semi-synthetic data. A large $\pi_{\text{core}}$ is crucial for ensuring MMCL's superior robustness compared to SL, but the value of $\pi_{\text{spu}}$ has minimal effect.
  • Figure 3: (a) MMCL is more robust than SL. (b) Caption richness and (c) intra-class contrasting contribute to robustness. Note that (c) is in a different setup than (a)(b), as detailed in Appendices \ref{['apdx: mscoco']} and \ref{['apdx: cc']}.
  • Figure 4: Construction of images
  • Figure 5: In-distribution test accuracy evaluated on a dataset constructed in the same way as the training data but with images from the MNIST testset.
  • ...and 2 more figures

Theorems & Definitions (46)

  • Definition 3.1
  • Lemma 3.2: Informal
  • Definition 4.1: Data Model 1
  • Theorem 4.3: Theorem 1 from sagawa2020investigation
  • Theorem 4.4
  • Corollary 4.5
  • Definition 4.6: Data Model 2
  • Theorem 4.7
  • Theorem 4.8
  • Definition 5.1: Feature masking in data model 1 (Definition \ref{['def: model_1']})
  • ...and 36 more