Table of Contents
Fetching ...

The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Stefanos Koutoupis, Michaela Areti Zervou, Konstantinos Kontras, Maarten De Vos, Panagiotis Tsakalides, Grigorios Tsagatakis

TL;DR

ConFu addresses the challenge of learning joint representations across three modalities by unifying pairwise and higher-order alignment within a single contrastive objective. It extends CLIP-style learning with a fusion-based higher-order term, providing a lower bound on the total correlation $\mathrm{TC}(X_1,X_2,X_3)$ and enabling effective 1→1 and 2→1 retrieval while preserving pairwise consistency. The authors introduce Bird-MML, a synthetic tri-modal dataset for pretraining and evaluating cross-modal complementarity, and demonstrate ConFu’s robustness and competitive performance across AV-MNIST, affective computing benchmarks, and fine-grained bird classification under noise and distribution shifts. Overall, ConFu offers a principled, scalable approach to higher-order multimodal alignment with minimal architectural overhead, validated by a new dataset and diverse experiments.

Abstract

Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.

The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

TL;DR

ConFu addresses the challenge of learning joint representations across three modalities by unifying pairwise and higher-order alignment within a single contrastive objective. It extends CLIP-style learning with a fusion-based higher-order term, providing a lower bound on the total correlation and enabling effective 1→1 and 2→1 retrieval while preserving pairwise consistency. The authors introduce Bird-MML, a synthetic tri-modal dataset for pretraining and evaluating cross-modal complementarity, and demonstrate ConFu’s robustness and competitive performance across AV-MNIST, affective computing benchmarks, and fine-grained bird classification under noise and distribution shifts. Overall, ConFu offers a principled, scalable approach to higher-order multimodal alignment with minimal architectural overhead, validated by a new dataset and diverse experiments.

Abstract

Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.

Paper Structure

This paper contains 59 sections, 15 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Confu unifies direct (1$\rightarrow$1) and compositional (2$\rightarrow$1) alignment within a single embedding space, utilizing two modalities for improved performance and adapting seamlessly when only one modality is available.
  • Figure 2: Overview of ConFu. The framework aligns all modality pairs through pairwise contrastive objectives while also aligning each modality with the fused representation of the remaining ones. The final loss ($\mathcal{L}$) combines both objectives (${\mathcal{L}_{pair}}$, ${\mathcal{L}_{fused}}$), balanced by a weighting factor $\lambda$.
  • Figure 3: Results on the synthetic XOR task. Our model's accuracy in predicting $z_2$ from $(z_1, z_3)$ is plotted against the mixing parameter $\hat{p}$. Our model captures the synergistic information, showing a positive trend in performance as $\hat{p}$ increases. Trimodal pairwise CLIP remains on chance ($\sim3\%$) while both GRAM and TRIANGLE fail to reach above 15% accuracy. More details are provided in Appendix \ref{['appendix:ablation_xor']}.
  • Figure 4: Few-shot linear probing results on the SSW60 (top) and VB100 (bottom) datasets. Performance is shown as the number of labeled examples increases. Zero-shot performance is indicated with a star. Prediction is done in the 8-frame average embedding setting.
  • Figure 5: Aggregated modality overlap proportions across all classes in the SSW60 dataset. The chart shows the overall fraction of correctly predicted samples belonging to each overlap category: Audiovisual Only (AV), Vision Only (V), Audio Only (A), Audiovisual & Vision (AV / V), Audiovisual & Audio (AV / A), Vision & Audio (V / A), All, and None.
  • ...and 9 more figures