Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning
Divyam Madaan, Taro Makino, Sumit Chopra, Kyunghyun Cho
TL;DR
This work introduces I2M2, a principled framework for multi-modal learning that simultaneously models inter- and intra-modality dependencies through a probabilistic, generative view. By factorizing the label likelihood into unimodal and multimodal components and combining them with a product of experts, I2M2 flexibly adapts to varying dependency strengths across tasks. Across datasets spanning healthcare and vision-language problems (e.g., AV-MNIST, FastMRI, MIMIC-III, NLVR2, VQA-VS), I2M2 consistently outperforms approaches that optimize only inter- or intra-modality cues, and demonstrates robustness to distribution shifts and missing dependencies. The results highlight the practical impact of jointly leveraging multiple sources of information and provide a scalable blueprint for designing versatile multimodal systems in diverse domains.
Abstract
Supervised multi-modal learning involves mapping multiple modalities to a target label. Previous studies in this field have concentrated on capturing in isolation either the inter-modality dependencies (the relationships between different modalities and the label) or the intra-modality dependencies (the relationships within a single modality and the label). We argue that these conventional approaches that rely solely on either inter- or intra-modality dependencies may not be optimal in general. We view the multi-modal learning problem from the lens of generative models where we consider the target as a source of multiple modalities and the interaction between them. Towards that end, we propose inter- & intra-modality modeling (I2M2) framework, which captures and integrates both the inter- and intra-modality dependencies, leading to more accurate predictions. We evaluate our approach using real-world healthcare and vision-and-language datasets with state-of-the-art models, demonstrating superior performance over traditional methods focusing only on one type of modality dependency.
