Table of Contents
Fetching ...

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Yuhui Zhang, Elaine Sui, Serena Yeung-Levy

TL;DR

The paper analyzes the geometry of multi-modal contrastive spaces and identifies a persistent modality gap and alignment noise that hinder cross-modal interchangeability. It provides a theoretical model where ${\mathbf e}_x - {\mathbf e}_y = {\mathbf c}_\perp + {\boldsymbol \epsilon}$, with a constant gap orthogonal to modality spans and Gaussian-like alignment noise, and shows that standard contrastive optimization fails to close the gap. Building on this, it proposes the simple three-step C^3 method—Connect, Collapse, Corrupt—to align embeddings and enable cross-modal tasks using uni-modal data. Empirically, C^3 achieves state-of-the-art zero-shot results on image/audio/video captioning and text-to-image generation, including strong performance in low-data regimes and successful generalization to other modalities and embedding spaces. The work offers a principled direction for data-efficient cross-modal learning and provides insights into the geometry of contrastive representation spaces.

Abstract

Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our $C^3$ method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation.

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

TL;DR

The paper analyzes the geometry of multi-modal contrastive spaces and identifies a persistent modality gap and alignment noise that hinder cross-modal interchangeability. It provides a theoretical model where , with a constant gap orthogonal to modality spans and Gaussian-like alignment noise, and shows that standard contrastive optimization fails to close the gap. Building on this, it proposes the simple three-step C^3 method—Connect, Collapse, Corrupt—to align embeddings and enable cross-modal tasks using uni-modal data. Empirically, C^3 achieves state-of-the-art zero-shot results on image/audio/video captioning and text-to-image generation, including strong performance in low-data regimes and successful generalization to other modalities and embedding spaces. The work offers a principled direction for data-efficient cross-modal learning and provides insights into the geometry of contrastive representation spaces.

Abstract

Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation.
Paper Structure (55 sections, 5 theorems, 15 equations, 14 figures, 7 tables, 1 algorithm)

This paper contains 55 sections, 5 theorems, 15 equations, 14 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

(Multi-modal Contrastive Representation Space Geometry) Given a paired image $x$ and text $y$, the relationship between the $\ell_2$-normalized image embedding $e_x$ and text embedding $e_y$ obtained from multi-modal contrastive learning can be described as: where ${\bm{c}}_\perp$ is a constant vector representing the modality gap and is orthogonal to the image and text embedding span, i.e., $\fo

Figures (14)

  • Figure 1: Interchangeable use of embeddings enables learning cross-modal tasks with uni-modal data.
  • Figure 2: Geometry of the multi-modal contrastive representation space.
  • Figure 3: Dimensional collapse of the CLIP representation space. Singular values obtained from SVD reveal that the effective dimension of the image and text representation space is much smaller than the total number of dimensions.
  • Figure 4: Variance of each dimension before (left) and after (right) multi-modal contrastive optimization. Our analysis reveals that gradients will only be propagated to effective dimensions and no gradient will be propagated to ineffective dimensions. Therefore, the effective dimensions are aligned while ineffective dimensions remain constant after optimization.
  • Figure 5: Stable region (green area) of contrastive learning controlled by temperature. Within the stable region, the loss falls below a small preset value, indicating that optimization has ended. The region increases as the temperature decreases.
  • ...and 9 more figures

Theorems & Definitions (8)

  • Proposition 1
  • Definition 1
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • proof : Proof of Lemma \ref{['prop:gradient']}
  • Lemma 4
  • proof : Proof of Lemma \ref{['prop:region']}