Sparsely Multimodal Data Fusion
Josiah Bjorgaard
TL;DR
This work addresses learning robust multimodal embeddings when data are sparsely multimodal by comparing three fusion approaches: Modal Channel Attention (MCA), Zorro, and Everything at Once (EAO). MCA constructs fusion embeddings for all modality combinations and uses a modality-specific attention mask plus contrastive loss to maintain uniformity while preserving alignment, enabling effective retrieval and strong linear probing on downstream tasks. Across CMU-MOSEI and TCGA datasets, MCA outperforms Zorro on retrieval and downstream regression/classification, while EAO dominates ranking due to its post-inference fusion strategy but underperforms on tasks requiring multimodal interactions. The results underscore the value of contrasting all modality subsets to produce robust fusion embeddings, with practical implications for real-world applications featuring incomplete data.
Abstract
Multimodal data fusion is essential for applications requiring the integration of diverse data sources, especially in the presence of incomplete or sparsely available modalities. This paper presents a comparative study of three multimodal embedding techniques, Modal Channel Attention (MCA), Zorro, and Everything at Once (EAO), to evaluate their performance on sparsely multimodal data. MCA introduces fusion embeddings for all combinations of input modalities and uses attention masking to create distinct attention channels, enabling flexible and efficient data fusion. Experiments on two datasets with four modalities each, CMU-MOSEI and TCGA, demonstrate that MCA outperforms Zorro across ranking, recall, regression, and classification tasks and outperforms EAO across regression and classification tasks. MCA achieves superior performance by maintaining robust uniformity across unimodal and fusion embeddings. While EAO performs best in ranking metrics due to its approach of forming fusion embeddings post-inference, it underperforms in downstream tasks requiring multimodal interactions. These results highlight the importance of contrasting all modality combinations in constructing embedding spaces and offers insights into the design of multimodal architectures for real-world applications with incomplete data.
