Table of Contents
Fetching ...

On-the-fly Modulation for Balanced Multimodal Learning

Yake Wei, Di Hu, Henghui Du, Ji-Rong Wen

TL;DR

On-the-fly Prediction Modulation and On-the-fly Gradient Modulation strategies are proposed to modulate the optimization of each modality, by monitoring the discriminative discrepancy between modalities during training.

Abstract

Multimodal learning is expected to boost model performance by integrating information from different modalities. However, its potential is not fully exploited because the widely-used joint training strategy, which has a uniform objective for all modalities, leads to imbalanced and under-optimized uni-modal representations. Specifically, we point out that there often exists modality with more discriminative information, e.g., vision of playing football and sound of blowing wind. They could dominate the joint training process, resulting in other modalities being significantly under-optimized. To alleviate this problem, we first analyze the under-optimized phenomenon from both the feed-forward and the back-propagation stages during optimization. Then, On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies are proposed to modulate the optimization of each modality, by monitoring the discriminative discrepancy between modalities during training. Concretely, OPM weakens the influence of the dominant modality by dropping its feature with dynamical probability in the feed-forward stage, while OGM mitigates its gradient in the back-propagation stage. In experiments, our methods demonstrate considerable improvement across a variety of multimodal tasks. These simple yet effective strategies not only enhance performance in vanilla and task-oriented multimodal models, but also in more complex multimodal tasks, showcasing their effectiveness and flexibility. The source code is available at \url{https://github.com/GeWu-Lab/BML_TPAMI2024}.

On-the-fly Modulation for Balanced Multimodal Learning

TL;DR

On-the-fly Prediction Modulation and On-the-fly Gradient Modulation strategies are proposed to modulate the optimization of each modality, by monitoring the discriminative discrepancy between modalities during training.

Abstract

Multimodal learning is expected to boost model performance by integrating information from different modalities. However, its potential is not fully exploited because the widely-used joint training strategy, which has a uniform objective for all modalities, leads to imbalanced and under-optimized uni-modal representations. Specifically, we point out that there often exists modality with more discriminative information, e.g., vision of playing football and sound of blowing wind. They could dominate the joint training process, resulting in other modalities being significantly under-optimized. To alleviate this problem, we first analyze the under-optimized phenomenon from both the feed-forward and the back-propagation stages during optimization. Then, On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies are proposed to modulate the optimization of each modality, by monitoring the discriminative discrepancy between modalities during training. Concretely, OPM weakens the influence of the dominant modality by dropping its feature with dynamical probability in the feed-forward stage, while OGM mitigates its gradient in the back-propagation stage. In experiments, our methods demonstrate considerable improvement across a variety of multimodal tasks. These simple yet effective strategies not only enhance performance in vanilla and task-oriented multimodal models, but also in more complex multimodal tasks, showcasing their effectiveness and flexibility. The source code is available at \url{https://github.com/GeWu-Lab/BML_TPAMI2024}.

Paper Structure

This paper contains 39 sections, 20 equations, 8 figures, 16 tables, 2 algorithms.

Figures (8)

  • Figure 1: Performance of individually trained uni-modal model, jointly trained multimodal model and jointly trained multimodal model with our proposed OPM and OGM strategies respectively on the VGGSound dataset. (a) Performance of visual modality. (b) Performance of audio modality. (c) Performance of audio-visual modalities. Best viewed in color. The training of our OPM and OGM methods exactly aligns with the applied audio-visual model. To provide more representative observation, here jointly trained multimodal use concatenation fusion, which is widely-used, and simple-but-strong. In Appendix D, we also extend these experiments to more complex CentralNet vielzeuf2018centralnet multimodal framework.
  • Figure 2: The pipeline of the On-the-fly Prediction Modulation. Here we take two modalities as examples. In the feed-forward stage, the feature of modality $m$ is randomly dropped with probability $q^m$, where the probability is determined by the discriminative discrepancy ratio at the last iteration. Via OPM, the remained feature of suppressed modality could affect the multimodal prediction more, accordingly improving its learning.
  • Figure 3: The pipeline of the On-the-fly Gradient Modulation strategy. Here we take two modalities as example. In the back-propagation stage, the gradient of modality $m$ is modulated with $k^m$, which is determined by the discriminative discrepancy ratio at this iteration. Via OGM, the gradient of modality with more discriminative information is weakened, while the remained modality is not affected and can gain more training.
  • Figure 4: Discrepancy ratio on Kinetics-Sounds dataset. (a) Discrepancy ratio of Concatenation, CentralNetvielzeuf2018centralnet, and our methods during training. (b) Discrepancy ratio of Concatenation, OGM and OGM* methods based on Concatenation fusion during training. OGM* indicates the strategy that only increases the gradient of worse learnt modality. More discussion about OGM* is provided in \ref{['sec:ogm_xing']}.
  • Figure 5: (a&b): Missing modality cases on Kinetics-Sounds dataset. (c&d): Missing modality cases on UCF-101 dataset. OPM method is based on Concatenation fusion.
  • ...and 3 more figures