Detached and Interactive Multimodal Learning

Yunfeng Fan; Wenchao Xu; Haozhao Wang; Junhong Liu; Song Guo

Detached and Interactive Multimodal Learning

Yunfeng Fan, Wenchao Xu, Haozhao Wang, Junhong Liu, Song Guo

TL;DR

Modality competition in joint multimodal learning degrades performance by letting strong modalities dominate learning. The paper introduces DI-MML, a detached training framework that trains each modality encoder with isolated objectives while enabling cross-modal interaction through a shared classifier and a Dimension-decoupled Unidirectional Contrastive (DUC) loss. An instance-level certainty-aware logit weighting strategy is applied during inference to fully exploit complementarities. Across CREMA-D, AVE, UCF101, and ModelNet40, DI-MML achieves competitive unimodal performance and superior multimodal accuracy, outperforming joint-training baselines and reducing modality competition. This approach provides a principled, competition-free pathway to leverage complementary information in multimodal data and can be extended to additional modalities and tasks.

Abstract

Recently, Multimodal Learning (MML) has gained significant interest as it compensates for single-modality limitations through comprehensive complementary information within multimodal data. However, traditional MML methods generally use the joint learning framework with a uniform learning objective that can lead to the modality competition issue, where feedback predominantly comes from certain modalities, limiting the full potential of others. In response to this challenge, this paper introduces DI-MML, a novel detached MML framework designed to learn complementary information across modalities under the premise of avoiding modality competition. Specifically, DI-MML addresses competition by separately training each modality encoder with isolated learning objectives. It further encourages cross-modal interaction via a shared classifier that defines a common feature space and employing a dimension-decoupled unidirectional contrastive (DUC) loss to facilitate modality-level knowledge transfer. Additionally, to account for varying reliability in sample pairs, we devise a certainty-aware logit weighting strategy to effectively leverage complementary information at the instance level during inference. Extensive experiments conducted on audio-visual, flow-image, and front-rear view datasets show the superior performance of our proposed method. The code is released at https://github.com/fanyunfeng-bit/DI-MML.

Detached and Interactive Multimodal Learning

TL;DR

Abstract

Paper Structure (19 sections, 9 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 19 sections, 9 equations, 8 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Modality Competition in MML
Contrastive Learning in MML
METHODOLOGY
Modality Competition Analysis
Detached and Interactive MML
Instance-level Weighting
Comparison with MCRL Loss
EXPERIMENTS
Dataset
Experimental Settings
The Effectiveness of DI-MML
Robustness Validation
Conclusion
...and 4 more sections

Figures (8)

Figure 1: The difference between previous methods with ours. Only our method abandons the uniform fusion objective and updates each modal network with isolated objectives.
Figure 2: Modality competition comes from uniform learning objective. The columns represent predicted probabilities for each class. The fused prediction is dominated by modality 1 (better), resulting in a significant gap between the fusion gradient and the gradient needed for modality 2 (weak).
Figure 3: Overall framework of DI-MML. The encoders of each modality are trained with isolated learning objectives. The connections and interactions between modalities during encoder training are enabled by shared classifier and DUC loss.
Figure 4: During inference, the logit weighting is utilized on instance level.
Figure 5: Traditional contrastive loss is hard, aligning all the dimensions bidirectionally. Our DUC loss is soft, performing on part of dimensions and only transferring complementarities. Blue and green colors denote effective dimensions and white means ineffective dimension. Red color represents alignment between corresponding dimensions.
...and 3 more figures

Detached and Interactive Multimodal Learning

TL;DR

Abstract

Detached and Interactive Multimodal Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)