Table of Contents
Fetching ...

Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning

Divyam Madaan, Taro Makino, Sumit Chopra, Kyunghyun Cho

TL;DR

This work introduces I2M2, a principled framework for multi-modal learning that simultaneously models inter- and intra-modality dependencies through a probabilistic, generative view. By factorizing the label likelihood into unimodal and multimodal components and combining them with a product of experts, I2M2 flexibly adapts to varying dependency strengths across tasks. Across datasets spanning healthcare and vision-language problems (e.g., AV-MNIST, FastMRI, MIMIC-III, NLVR2, VQA-VS), I2M2 consistently outperforms approaches that optimize only inter- or intra-modality cues, and demonstrates robustness to distribution shifts and missing dependencies. The results highlight the practical impact of jointly leveraging multiple sources of information and provide a scalable blueprint for designing versatile multimodal systems in diverse domains.

Abstract

Supervised multi-modal learning involves mapping multiple modalities to a target label. Previous studies in this field have concentrated on capturing in isolation either the inter-modality dependencies (the relationships between different modalities and the label) or the intra-modality dependencies (the relationships within a single modality and the label). We argue that these conventional approaches that rely solely on either inter- or intra-modality dependencies may not be optimal in general. We view the multi-modal learning problem from the lens of generative models where we consider the target as a source of multiple modalities and the interaction between them. Towards that end, we propose inter- & intra-modality modeling (I2M2) framework, which captures and integrates both the inter- and intra-modality dependencies, leading to more accurate predictions. We evaluate our approach using real-world healthcare and vision-and-language datasets with state-of-the-art models, demonstrating superior performance over traditional methods focusing only on one type of modality dependency.

Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning

TL;DR

This work introduces I2M2, a principled framework for multi-modal learning that simultaneously models inter- and intra-modality dependencies through a probabilistic, generative view. By factorizing the label likelihood into unimodal and multimodal components and combining them with a product of experts, I2M2 flexibly adapts to varying dependency strengths across tasks. Across datasets spanning healthcare and vision-language problems (e.g., AV-MNIST, FastMRI, MIMIC-III, NLVR2, VQA-VS), I2M2 consistently outperforms approaches that optimize only inter- or intra-modality cues, and demonstrates robustness to distribution shifts and missing dependencies. The results highlight the practical impact of jointly leveraging multiple sources of information and provide a scalable blueprint for designing versatile multimodal systems in diverse domains.

Abstract

Supervised multi-modal learning involves mapping multiple modalities to a target label. Previous studies in this field have concentrated on capturing in isolation either the inter-modality dependencies (the relationships between different modalities and the label) or the intra-modality dependencies (the relationships within a single modality and the label). We argue that these conventional approaches that rely solely on either inter- or intra-modality dependencies may not be optimal in general. We view the multi-modal learning problem from the lens of generative models where we consider the target as a source of multiple modalities and the interaction between them. Towards that end, we propose inter- & intra-modality modeling (I2M2) framework, which captures and integrates both the inter- and intra-modality dependencies, leading to more accurate predictions. We evaluate our approach using real-world healthcare and vision-and-language datasets with state-of-the-art models, demonstrating superior performance over traditional methods focusing only on one type of modality dependency.
Paper Structure (45 sections, 7 equations, 7 figures, 9 tables)

This paper contains 45 sections, 7 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Data generating process for various scenarios with two modalities $\mathbf{x}, \mathbf{x'}$ and output $\mathbf{y}$. In the context of multi-modal learning a), the label modulates the individual modalities (referred to as intra-modality dependencies) and the interaction between them (referred to as inter-modality dependency) through the selection variable $\mathbf{v}$ which is always set to one. In contrast, conventional approaches assume the graphical model in b) or c). In the graphical model shown in b), the dependency between each individual modality and the label does not modulate through the selection variable $\mathbf{v}$. On the other hand, the graph in c) assumes that the dependency between two modalities is independent of the label.
  • Figure 2: Results on fastMRI dataset. We compare root-sum-of-squares, magnitude and phase unimodal models, intra-modality modeling, inter-modality modeling, and I2M2 models (bars are in the same order). I2M2 obtains comparable performance to the intra-modality model by ignoring the inter-modality dependency, because comparatively to intra-modality, it contributes less to predicting the label.
  • Figure 3: AUROC performance for models with identical parameter count. We compare an ensemble of three magnitude-only, an ensemble of three phase-only models, an ensemble of magnitude and phase, an ensemble of magnitude, phase, and inter-modality model with our I2M2. Despite having the same number of parameters, our proposed model consistently outperforms the ensemble models.
  • Figure 4: AUROC comparison with WideResNet models. We compare magnitude-only, phase-only models, inter-modality model trained with architecture WideResNet-20-3 to our I2M2 trained with PreactResNet-18 across various knee pathologies in the fastMRI dataset (bars are in the same order). I2M2 obtains higher performance across all pathologies in comparison to the wider models.
  • Figure 5: Visualization of samples from VQA-VS OOD test-sets. We visualize random instances without text, image and multi-modal spurious dependencies. Specifically, words like "what", "sport", and "is" in questions, the presence of "kite" in the image, and a combination of "how" and "many" in the question with animals in the image are spuriously correlated with the labels "tennis", "kite", and "one", "two", "three" respectively in the IID sets. I2M2 using a product of experts correctly predicts the target even when the spurious dependencies are absent, while individual expert models do not.
  • ...and 2 more figures