Table of Contents
Fetching ...

Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition

Chengxiang Huang, Yake Wei, Zequn Yang, Di Hu

TL;DR

This work identifies a prime learning window in multimodal training where information acquisition is imbalanced across modalities and shows that information-rich modalities can suppress others. It introduces Information Acquisition Regulation (InfoReg), which adaptively slows information flow for dominant modalities during this window using a Fisher Information-based metric and a per-batch regulation term. The method combines unimodal and multimodal losses with an adaptive coefficient alpha that depends on the observed performance gap, improving information uptake for information-scarce modalities and yielding higher overall accuracy. Across CREMA-D, Kinetics Sounds, and CMU-MOSI, InfoReg outperforms existing imbalanced methods and demonstrates robustness to fusion strategies and architectural settings, with the prime window shown to be essential for gains. The work provides practical implications for designing training schedules in multimodal systems and offers code for reproducibility.

Abstract

Sensory training during the early ages is vital for human development. Inspired by this cognitive phenomenon, we observe that the early training stage is also important for the multimodal learning process, where dataset information is rapidly acquired. We refer to this stage as the prime learning window. However, based on our observation, this prime learning window in multimodal learning is often dominated by information-sufficient modalities, which in turn suppresses the information acquisition of information-insufficient modalities. To address this issue, we propose Information Acquisition Regulation (InfoReg), a method designed to balance information acquisition among modalities. Specifically, InfoReg slows down the information acquisition process of information-sufficient modalities during the prime learning window, which could promote information acquisition of information-insufficient modalities. This regulation enables a more balanced learning process and improves the overall performance of the multimodal network. Experiments show that InfoReg outperforms related multimodal imbalanced methods across various datasets, achieving superior model performance. The code is available at https://github.com/GeWu-Lab/InfoReg_CVPR2025.

Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition

TL;DR

This work identifies a prime learning window in multimodal training where information acquisition is imbalanced across modalities and shows that information-rich modalities can suppress others. It introduces Information Acquisition Regulation (InfoReg), which adaptively slows information flow for dominant modalities during this window using a Fisher Information-based metric and a per-batch regulation term. The method combines unimodal and multimodal losses with an adaptive coefficient alpha that depends on the observed performance gap, improving information uptake for information-scarce modalities and yielding higher overall accuracy. Across CREMA-D, Kinetics Sounds, and CMU-MOSI, InfoReg outperforms existing imbalanced methods and demonstrates robustness to fusion strategies and architectural settings, with the prime window shown to be essential for gains. The work provides practical implications for designing training schedules in multimodal systems and offers code for reproducibility.

Abstract

Sensory training during the early ages is vital for human development. Inspired by this cognitive phenomenon, we observe that the early training stage is also important for the multimodal learning process, where dataset information is rapidly acquired. We refer to this stage as the prime learning window. However, based on our observation, this prime learning window in multimodal learning is often dominated by information-sufficient modalities, which in turn suppresses the information acquisition of information-insufficient modalities. To address this issue, we propose Information Acquisition Regulation (InfoReg), a method designed to balance information acquisition among modalities. Specifically, InfoReg slows down the information acquisition process of information-sufficient modalities during the prime learning window, which could promote information acquisition of information-insufficient modalities. This regulation enables a more balanced learning process and improves the overall performance of the multimodal network. Experiments show that InfoReg outperforms related multimodal imbalanced methods across various datasets, achieving superior model performance. The code is available at https://github.com/GeWu-Lab/InfoReg_CVPR2025.

Paper Structure

This paper contains 28 sections, 36 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a). Information amount variation of the audio encoder, video encoder, and multimodal model during the training process on CREMA-D cao2014crema. (b). Information amount variation of the audio and video modalities when trained independently on CREMA-D.
  • Figure 2: Overview of InfoReg. This figure shows the main components and workflow of InfoReg. The left side presents our overall framework, while the right side highlights the adaptive unimodal regulation. During the training, InfoReg first identifies the information-sufficient modalities, then evaluates whether they are in the prime learning window, and finally applies adaptive unimodal regulation.
  • Figure 3: (a). The gradient gap between the audio modality and video modality on CREMA-D. (b). The $Tr(F_m)$ gap between the audio modality and video modality on CREMA-D.
  • Figure 4: The cosine similarities of gradients across different batches within the prime learning window.
  • Figure 5: (a). The overall accuracy, audio accuracy, and video accuracy of InfoReg are compared with Joint training. (b). The value of $Tr(F)$ in InfoReg for both modalities. (c). The value of $Tr(F)$ of the video modality in InfoReg compared with that of Joint training. All experiments are conducted on CREMA-D.
  • ...and 5 more figures