Sparsely Multimodal Data Fusion

Josiah Bjorgaard

Sparsely Multimodal Data Fusion

Josiah Bjorgaard

TL;DR

This work addresses learning robust multimodal embeddings when data are sparsely multimodal by comparing three fusion approaches: Modal Channel Attention (MCA), Zorro, and Everything at Once (EAO). MCA constructs fusion embeddings for all modality combinations and uses a modality-specific attention mask plus contrastive loss to maintain uniformity while preserving alignment, enabling effective retrieval and strong linear probing on downstream tasks. Across CMU-MOSEI and TCGA datasets, MCA outperforms Zorro on retrieval and downstream regression/classification, while EAO dominates ranking due to its post-inference fusion strategy but underperforms on tasks requiring multimodal interactions. The results underscore the value of contrasting all modality subsets to produce robust fusion embeddings, with practical implications for real-world applications featuring incomplete data.

Abstract

Multimodal data fusion is essential for applications requiring the integration of diverse data sources, especially in the presence of incomplete or sparsely available modalities. This paper presents a comparative study of three multimodal embedding techniques, Modal Channel Attention (MCA), Zorro, and Everything at Once (EAO), to evaluate their performance on sparsely multimodal data. MCA introduces fusion embeddings for all combinations of input modalities and uses attention masking to create distinct attention channels, enabling flexible and efficient data fusion. Experiments on two datasets with four modalities each, CMU-MOSEI and TCGA, demonstrate that MCA outperforms Zorro across ranking, recall, regression, and classification tasks and outperforms EAO across regression and classification tasks. MCA achieves superior performance by maintaining robust uniformity across unimodal and fusion embeddings. While EAO performs best in ranking metrics due to its approach of forming fusion embeddings post-inference, it underperforms in downstream tasks requiring multimodal interactions. These results highlight the importance of contrasting all modality combinations in constructing embedding spaces and offers insights into the design of multimodal architectures for real-world applications with incomplete data.

Sparsely Multimodal Data Fusion

TL;DR

Abstract

Paper Structure (15 sections, 3 equations, 7 figures)

This paper contains 15 sections, 3 equations, 7 figures.

Introduction
Comparison of Related Work
Model
Methods
Datasets
CMU-MOSEI
TCGA
Modal Sparsity
Training
Results
Uniformity and Alignment
Ranking and Recall
Regression and Classification
Conclusion
Appendix

Figures (7)

Figure 1: An overview of the main motivation and purpose of this study, where multimodal datasets (in this case, 4 modalities) that have samples with missing modalities can be encoded into a fused embedding space. The embeddings are used to perform both ranking and retrieval tasks, as well as for downstream regression and classification tasks.
Figure 2: A comparison of the multimodal data fusion design of EAO shvetsova2022everything, Zorro recasens2023zorro, and MCA (this work) where data is fused and with various combinations of modalities. The diagram demonstrates fusions for two and three modalities, demonstrating the similarities and differences between the studied models when increasing the number of modalities.
Figure 3: (a) MCA model architecture demonstrating a single forward pass for modal fusion with 4 modalities. The upper figure demonstrates when all modalities are present and the lower figure shows an example of loss masking when 2 modalities are absent. $N_i$ represents the number of tokens of the related type. (b) An example of a modal channel attention mask for a 4 modality dataset with all possible modality combinations. Inside boxes correspond to attention in other models. In EAO, no learnable fusion tokens are used and each unimodal and 2 modality fusion is performed in a separate forward pass. The attention mask used in Zorro is exactly as shown by including the learnable fusion tokens with attention from all modalities.
Figure 4: Uniformity and alignment metrics as a function of dataset sparsity for CMU-MOSEI and TCGA dataset embeddings calculated from test dataset splits for (a) uniformity of fusion embeddings ($\downarrow$); (b) mean uniformity of unimodal embeddings ($\downarrow$); (c) Mean alignment between unimodal and fusion token embedding spaces.($\downarrow$);
Figure 5: Rank and recall metrics for embeddings from models trained with various modal sparsity on the CMU-MOSEI and TCGA datasets. (a) Median Rank for CMU-MOSEI ($\downarrow$); (b) Median Rank for TCGA ($\downarrow$); (c) Recall for CMU-MOSEI ($\uparrow$); (d) Recall for TCGA ($\uparrow$);
...and 2 more figures

Sparsely Multimodal Data Fusion

TL;DR

Abstract

Sparsely Multimodal Data Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (7)