Towards Uniformity and Alignment for Multimodal Representation Learning
Wenzhe Yin, Pan Zhou, Zehao Xiao, Jie Liu, Shujian Yu, Jan-Jakob Sonke, Efstratios Gavves
TL;DR
This work analyzes intrinsic conflicts in multimodal InfoNCE objectives as the number of modalities grows, identifying alignment–uniformity and intra-alignment conflicts that create cross-modal distribution gaps. It proposes UniAlign, a principled decoupling of intra-modality uniformity and anchor-based cross-modal alignment, with optional tuple-level extensions, and grounds the approach in a global Hölder divergence that it effectively approximates via KDE-based surrogates. The method yields consistent improvements in cross-modal retrieval and UnCLIP-style generation across datasets and decoders, reducing modality gaps without task-specific modules. Overall, UniAlign provides a scalable, theoretically justified framework for jointly enabling discriminative and generative multimodal learning beyond pairwise InfoNCE.
Abstract
Multimodal representation learning aims to construct a shared embedding space in which heterogeneous modalities are semantically aligned. Despite strong empirical results, InfoNCE-based objectives introduce inherent conflicts that yield distribution gaps across modalities. In this work, we identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases: (i) an alignment-uniformity conflict, whereby the repulsion of uniformity undermines pairwise alignment, and (ii) an intra-alignment conflict, where aligning multiple modalities induces competing alignment directions. To address these issues, we propose a principled decoupling of alignment and uniformity for multimodal representations, providing a conflict-free recipe for multimodal learning that simultaneously supports discriminative and generative use cases without task-specific modules. We then provide a theoretical guarantee that our method acts as an efficient proxy for a global Hölder divergence over multiple modality distributions, and thus reduces the distribution gap among modalities. Extensive experiments on retrieval and UnCLIP-style generation demonstrate consistent gains.
