Table of Contents
Fetching ...

Towards Uniformity and Alignment for Multimodal Representation Learning

Wenzhe Yin, Pan Zhou, Zehao Xiao, Jie Liu, Shujian Yu, Jan-Jakob Sonke, Efstratios Gavves

TL;DR

This work analyzes intrinsic conflicts in multimodal InfoNCE objectives as the number of modalities grows, identifying alignment–uniformity and intra-alignment conflicts that create cross-modal distribution gaps. It proposes UniAlign, a principled decoupling of intra-modality uniformity and anchor-based cross-modal alignment, with optional tuple-level extensions, and grounds the approach in a global Hölder divergence that it effectively approximates via KDE-based surrogates. The method yields consistent improvements in cross-modal retrieval and UnCLIP-style generation across datasets and decoders, reducing modality gaps without task-specific modules. Overall, UniAlign provides a scalable, theoretically justified framework for jointly enabling discriminative and generative multimodal learning beyond pairwise InfoNCE.

Abstract

Multimodal representation learning aims to construct a shared embedding space in which heterogeneous modalities are semantically aligned. Despite strong empirical results, InfoNCE-based objectives introduce inherent conflicts that yield distribution gaps across modalities. In this work, we identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases: (i) an alignment-uniformity conflict, whereby the repulsion of uniformity undermines pairwise alignment, and (ii) an intra-alignment conflict, where aligning multiple modalities induces competing alignment directions. To address these issues, we propose a principled decoupling of alignment and uniformity for multimodal representations, providing a conflict-free recipe for multimodal learning that simultaneously supports discriminative and generative use cases without task-specific modules. We then provide a theoretical guarantee that our method acts as an efficient proxy for a global Hölder divergence over multiple modality distributions, and thus reduces the distribution gap among modalities. Extensive experiments on retrieval and UnCLIP-style generation demonstrate consistent gains.

Towards Uniformity and Alignment for Multimodal Representation Learning

TL;DR

This work analyzes intrinsic conflicts in multimodal InfoNCE objectives as the number of modalities grows, identifying alignment–uniformity and intra-alignment conflicts that create cross-modal distribution gaps. It proposes UniAlign, a principled decoupling of intra-modality uniformity and anchor-based cross-modal alignment, with optional tuple-level extensions, and grounds the approach in a global Hölder divergence that it effectively approximates via KDE-based surrogates. The method yields consistent improvements in cross-modal retrieval and UnCLIP-style generation across datasets and decoders, reducing modality gaps without task-specific modules. Overall, UniAlign provides a scalable, theoretically justified framework for jointly enabling discriminative and generative multimodal learning beyond pairwise InfoNCE.

Abstract

Multimodal representation learning aims to construct a shared embedding space in which heterogeneous modalities are semantically aligned. Despite strong empirical results, InfoNCE-based objectives introduce inherent conflicts that yield distribution gaps across modalities. In this work, we identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases: (i) an alignment-uniformity conflict, whereby the repulsion of uniformity undermines pairwise alignment, and (ii) an intra-alignment conflict, where aligning multiple modalities induces competing alignment directions. To address these issues, we propose a principled decoupling of alignment and uniformity for multimodal representations, providing a conflict-free recipe for multimodal learning that simultaneously supports discriminative and generative use cases without task-specific modules. We then provide a theoretical guarantee that our method acts as an efficient proxy for a global Hölder divergence over multiple modality distributions, and thus reduces the distribution gap among modalities. Extensive experiments on retrieval and UnCLIP-style generation demonstrate consistent gains.
Paper Structure (26 sections, 4 theorems, 38 equations, 9 figures, 6 tables)

This paper contains 26 sections, 4 theorems, 38 equations, 9 figures, 6 tables.

Key Result

Proposition 2.2

Let $\bm{\Phi}_a=\sum_{n\neq a}\bm{\Phi}_a^{(n)}$ be the total uniformity force on anchor $a$, and define $\zeta_a \!=\! \cos\!(\mathbf{V}_a,\bm{\Phi}_a)$. Under Assumption assump:sys-conflict, the alignment–uniformity conflict converges to its maximum as the number of modalities $M$ increases:

Figures (9)

  • Figure 1: Two conflicts of multi-modal InfoNCE. (a) Alignment--uniformity: positives are pulled together yet repelled by the uniformity force; (b) Intra-alignment: non-collinear positives induce angular tension. Both grow with $M$.
  • Figure 2: ImageBind (paired InfoNCE).
  • Figure 3: GRAM (volume InfoNCE).
  • Figure 4: Ours
  • Figure 6: Modality-interpolation generation results (T+A) $\to$ I. When interpolating between text and audio representations, our method has a better ability to fuse the semantic information across modalities, leading to better generation.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Proposition 2.2: Alignment–Uniformity Conflict
  • Proposition 2.3: Intra-alignment Conflict
  • Proposition 1: Alignment–uniformity Conflict
  • proof
  • Proposition 2: Intra-alignment Conflict
  • proof