Towards Uniformity and Alignment for Multimodal Representation Learning

Wenzhe Yin; Pan Zhou; Zehao Xiao; Jie Liu; Shujian Yu; Jan-Jakob Sonke; Efstratios Gavves

Towards Uniformity and Alignment for Multimodal Representation Learning

Wenzhe Yin, Pan Zhou, Zehao Xiao, Jie Liu, Shujian Yu, Jan-Jakob Sonke, Efstratios Gavves

TL;DR

This work analyzes intrinsic conflicts in multimodal InfoNCE objectives as the number of modalities grows, identifying alignment–uniformity and intra-alignment conflicts that create cross-modal distribution gaps. It proposes UniAlign, a principled decoupling of intra-modality uniformity and anchor-based cross-modal alignment, with optional tuple-level extensions, and grounds the approach in a global Hölder divergence that it effectively approximates via KDE-based surrogates. The method yields consistent improvements in cross-modal retrieval and UnCLIP-style generation across datasets and decoders, reducing modality gaps without task-specific modules. Overall, UniAlign provides a scalable, theoretically justified framework for jointly enabling discriminative and generative multimodal learning beyond pairwise InfoNCE.

Abstract

Multimodal representation learning aims to construct a shared embedding space in which heterogeneous modalities are semantically aligned. Despite strong empirical results, InfoNCE-based objectives introduce inherent conflicts that yield distribution gaps across modalities. In this work, we identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases: (i) an alignment-uniformity conflict, whereby the repulsion of uniformity undermines pairwise alignment, and (ii) an intra-alignment conflict, where aligning multiple modalities induces competing alignment directions. To address these issues, we propose a principled decoupling of alignment and uniformity for multimodal representations, providing a conflict-free recipe for multimodal learning that simultaneously supports discriminative and generative use cases without task-specific modules. We then provide a theoretical guarantee that our method acts as an efficient proxy for a global Hölder divergence over multiple modality distributions, and thus reduces the distribution gap among modalities. Extensive experiments on retrieval and UnCLIP-style generation demonstrate consistent gains.

Towards Uniformity and Alignment for Multimodal Representation Learning

TL;DR

Abstract

Paper Structure (26 sections, 4 theorems, 38 equations, 9 figures, 6 tables)

This paper contains 26 sections, 4 theorems, 38 equations, 9 figures, 6 tables.

Introduction
Multimodal Conflict Analysis
Uniformity and Alignment Conflict of InfoNCE.
Systematic Multimodal Conflict Analysis
Methodology
General Principle for Multimodal Learning
Theoretical Analysis from Divergence Perspective
Tuple-Level Extensions
Related Work
Experiments
Cross-modal Retrieval
Cross-modal Generation
Conclusion
Proof of Proposition \ref{['cor:alignment-uniformity-conflict']}
Justification of the Decomposition
...and 11 more sections

Key Result

Proposition 2.2

Let $\bm{\Phi}_a=\sum_{n\neq a}\bm{\Phi}_a^{(n)}$ be the total uniformity force on anchor $a$, and define $\zeta_a \!=\! \cos\!(\mathbf{V}_a,\bm{\Phi}_a)$. Under Assumption assump:sys-conflict, the alignment–uniformity conflict converges to its maximum as the number of modalities $M$ increases:

Figures (9)

Figure 1: Two conflicts of multi-modal InfoNCE. (a) Alignment--uniformity: positives are pulled together yet repelled by the uniformity force; (b) Intra-alignment: non-collinear positives induce angular tension. Both grow with $M$.
Figure 2: ImageBind (paired InfoNCE).
Figure 3: GRAM (volume InfoNCE).
Figure 4: Ours
Figure 6: Modality-interpolation generation results (T+A) $\to$ I. When interpolating between text and audio representations, our method has a better ability to fuse the semantic information across modalities, leading to better generation.
...and 4 more figures

Theorems & Definitions (6)

Proposition 2.2: Alignment–Uniformity Conflict
Proposition 2.3: Intra-alignment Conflict
Proposition 1: Alignment–uniformity Conflict
proof
Proposition 2: Intra-alignment Conflict
proof

Towards Uniformity and Alignment for Multimodal Representation Learning

TL;DR

Abstract

Towards Uniformity and Alignment for Multimodal Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (6)