Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
Yuhui Zhang, Elaine Sui, Serena Yeung-Levy
TL;DR
The paper analyzes the geometry of multi-modal contrastive spaces and identifies a persistent modality gap and alignment noise that hinder cross-modal interchangeability. It provides a theoretical model where ${\mathbf e}_x - {\mathbf e}_y = {\mathbf c}_\perp + {\boldsymbol \epsilon}$, with a constant gap orthogonal to modality spans and Gaussian-like alignment noise, and shows that standard contrastive optimization fails to close the gap. Building on this, it proposes the simple three-step C^3 method—Connect, Collapse, Corrupt—to align embeddings and enable cross-modal tasks using uni-modal data. Empirically, C^3 achieves state-of-the-art zero-shot results on image/audio/video captioning and text-to-image generation, including strong performance in low-data regimes and successful generalization to other modalities and embedding spaces. The work offers a principled direction for data-efficient cross-modal learning and provides insights into the geometry of contrastive representation spaces.
Abstract
Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our $C^3$ method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation.
