Table of Contents
Fetching ...

AIM: Adaptive Intra-Network Modulation for Balanced Multimodal Learning

Shu Shen, C. L. Philip Chen, Tong Zhang

TL;DR

The paper tackles modality imbalance in multimodal learning, caused by disparate optimization paces across network parameters and depths. It introduces Adaptive Intra-Network Modulation (AIM), a training framework combining Parameter-Adaptive Modulation (PAM), Depth-Adaptive Modulation (DAM), and Depth-Adaptive Prototypes (DAP) to stimulate slow-optimizing components while avoiding suppression of strong modalities. Through a parameter decoupling mechanism and auxiliary blocks, AIM redistributes learning emphasis at each depth based on depth-aware performance estimates, achieving balanced learning and improved unimodal and multimodal results. Extensive experiments across four datasets and multiple backbones demonstrate AIM’s superiority over state-of-the-art baselines, its generalizability, and its potential to accelerate convergence with efficient training.

Abstract

Multimodal learning has significantly enhanced machine learning performance but still faces numerous challenges and limitations. Imbalanced multimodal learning is one of the problems extensively studied in recent works and is typically mitigated by modulating the learning of each modality. However, we find that these methods typically hinder the dominant modality's learning to promote weaker modalities, which affects overall multimodal performance. We analyze the cause of this issue and highlight a commonly overlooked problem: optimization bias within networks. To address this, we propose Adaptive Intra-Network Modulation (AIM) to improve balanced modality learning. AIM accounts for differences in optimization state across parameters and depths within the network during modulation, achieving balanced multimodal learning without hindering either dominant or weak modalities for the first time. Specifically, AIM decouples the dominant modality's under-optimized parameters into Auxiliary Blocks and encourages reliance on these performance-degraded blocks for joint training with weaker modalities. This approach effectively prevents suppression of weaker modalities while enabling targeted optimization of under-optimized parameters to improve the dominant modality. Additionally, AIM assesses modality imbalance level across network depths and adaptively adjusts modulation strength at each depth. Experimental results demonstrate that AIM outperforms state-of-the-art imbalanced modality learning methods across multiple benchmarks and exhibits strong generalizability across different backbones, fusion strategies, and optimizers.

AIM: Adaptive Intra-Network Modulation for Balanced Multimodal Learning

TL;DR

The paper tackles modality imbalance in multimodal learning, caused by disparate optimization paces across network parameters and depths. It introduces Adaptive Intra-Network Modulation (AIM), a training framework combining Parameter-Adaptive Modulation (PAM), Depth-Adaptive Modulation (DAM), and Depth-Adaptive Prototypes (DAP) to stimulate slow-optimizing components while avoiding suppression of strong modalities. Through a parameter decoupling mechanism and auxiliary blocks, AIM redistributes learning emphasis at each depth based on depth-aware performance estimates, achieving balanced learning and improved unimodal and multimodal results. Extensive experiments across four datasets and multiple backbones demonstrate AIM’s superiority over state-of-the-art baselines, its generalizability, and its potential to accelerate convergence with efficient training.

Abstract

Multimodal learning has significantly enhanced machine learning performance but still faces numerous challenges and limitations. Imbalanced multimodal learning is one of the problems extensively studied in recent works and is typically mitigated by modulating the learning of each modality. However, we find that these methods typically hinder the dominant modality's learning to promote weaker modalities, which affects overall multimodal performance. We analyze the cause of this issue and highlight a commonly overlooked problem: optimization bias within networks. To address this, we propose Adaptive Intra-Network Modulation (AIM) to improve balanced modality learning. AIM accounts for differences in optimization state across parameters and depths within the network during modulation, achieving balanced multimodal learning without hindering either dominant or weak modalities for the first time. Specifically, AIM decouples the dominant modality's under-optimized parameters into Auxiliary Blocks and encourages reliance on these performance-degraded blocks for joint training with weaker modalities. This approach effectively prevents suppression of weaker modalities while enabling targeted optimization of under-optimized parameters to improve the dominant modality. Additionally, AIM assesses modality imbalance level across network depths and adaptively adjusts modulation strength at each depth. Experimental results demonstrate that AIM outperforms state-of-the-art imbalanced modality learning methods across multiple benchmarks and exhibits strong generalizability across different backbones, fusion strategies, and optimizers.

Paper Structure

This paper contains 31 sections, 15 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Accuracy variation of each modality while training on the Kinetics-Sounds dataset for unimodal-only models, the joint-training multimodal baseline, the state-of-the-art balanced multimodal learning method D&R wei2024diagnosing, and our proposed AIM. * represents the dominant modality with higher performance.
  • Figure 2: An overview of the proposed Adaptive Intra-Network Modulation (AIM). (a) The multimodal framework; (b) Overview of the of AIM including Parameter-Adaptive Modulation (PAM) and Depth-Adaptive Modulation (DAM); (c) The Depth-Adaptive Prototypes (DAP); (d) Detailed implementation of PAM at depth $d$. The implementation of the parameter decoupling mechanism (green block) is presented in Fig. \ref{['fig_param_decouple']}. Without loss of generality, this figure illustrates the case of two modalities, with blue and orange representing different modalities, and Modality 1 set as the dominant modality.
  • Figure 3: The detailed implementation of the parameter decoupling mechanism applied on the network block $\textbf{E}_d^m$.
  • Figure 4: Visualization of inter-class orthogonality of depth-adaptive prototypes at different depths of the audio network on CREMA-D.
  • Figure 5: Variation of the average performance $\bar{s}$ of visual modality’s original network block and auxiliary block during training on CREMA-D. (a) Both original network block and Auxiliary Block participate in joint training; (b) Only the Auxiliary Block participates in joint training.
  • ...and 3 more figures