Table of Contents
Fetching ...

Diagnosing and Re-learning for Balanced Multimodal Learning

Yake Wei, Siwei Li, Ruoxuan Feng, Di Hu

TL;DR

The paper tackles the imbalanced multimodal learning problem where some modalities are intrinsically less informative. It introduces Diagnosing_Relearning (D&R), which first assesses per-modality learning state via uni-modal representation separability, quantified by a purity-gap metric $g^k=|P^k_ ext{D}-P^k_ ext{V}|$, and then softly re-initializes encoders with strength $ ilde{ heta}_k= anh(\\lambda g^k)$ so that well-learnt modalities are de-emphasized and underfitting ones are enhanced; the update is $oldsymbol{ heta}_k=(1- ilde{ heta}_k)oldsymbol{ heta}_k^ ext{current}+ ilde{ heta}_koldsymbol{ heta}_k^ ext{init}$. This approach preserves cross-modal knowledge while preventing over-fitting on noisy modalities, and it is compatible with various backbones, including transformers. Experiments on CREMA-D, Kinetics Sounds, UCF-101, and CMU-MOSI show superior and robust improvements over state-of-the-art imbalanced multimodal methods across two-, and multi-modality settings, including scenarios with scarcely informative modalities. The method’s simplicity, flexibility, and demonstrated gains suggest strong practical impact for balanced multimodal learning in diverse applications.

Abstract

To overcome the imbalanced multimodal learning problem, where models prefer the training of specific modalities, existing methods propose to control the training of uni-modal encoders from different perspectives, taking the inter-modal performance discrepancy as the basis. However, the intrinsic limitation of modality capacity is ignored. The scarcely informative modalities can be recognized as ``worse-learnt'' ones, which could force the model to memorize more noise, counterproductively affecting the multimodal model ability. Moreover, the current modality modulation methods narrowly concentrate on selected worse-learnt modalities, even suppressing the training of others. Hence, it is essential to consider the intrinsic limitation of modality capacity and take all modalities into account during balancing. To this end, we propose the Diagnosing \& Re-learning method. The learning state of each modality is firstly estimated based on the separability of its uni-modal representation space, and then used to softly re-initialize the corresponding uni-modal encoder. In this way, the over-emphasizing of scarcely informative modalities is avoided. In addition, encoders of worse-learnt modalities are enhanced, simultaneously avoiding the over-training of other modalities. Accordingly, multimodal learning is effectively balanced and enhanced. Experiments covering multiple types of modalities and multimodal frameworks demonstrate the superior performance of our simple-yet-effective method for balanced multimodal learning. The source code and dataset are available at \url{https://github.com/GeWu-Lab/Diagnosing_Relearning_ECCV2024}.

Diagnosing and Re-learning for Balanced Multimodal Learning

TL;DR

The paper tackles the imbalanced multimodal learning problem where some modalities are intrinsically less informative. It introduces Diagnosing_Relearning (D&R), which first assesses per-modality learning state via uni-modal representation separability, quantified by a purity-gap metric , and then softly re-initializes encoders with strength so that well-learnt modalities are de-emphasized and underfitting ones are enhanced; the update is . This approach preserves cross-modal knowledge while preventing over-fitting on noisy modalities, and it is compatible with various backbones, including transformers. Experiments on CREMA-D, Kinetics Sounds, UCF-101, and CMU-MOSI show superior and robust improvements over state-of-the-art imbalanced multimodal methods across two-, and multi-modality settings, including scenarios with scarcely informative modalities. The method’s simplicity, flexibility, and demonstrated gains suggest strong practical impact for balanced multimodal learning in diverse applications.

Abstract

To overcome the imbalanced multimodal learning problem, where models prefer the training of specific modalities, existing methods propose to control the training of uni-modal encoders from different perspectives, taking the inter-modal performance discrepancy as the basis. However, the intrinsic limitation of modality capacity is ignored. The scarcely informative modalities can be recognized as ``worse-learnt'' ones, which could force the model to memorize more noise, counterproductively affecting the multimodal model ability. Moreover, the current modality modulation methods narrowly concentrate on selected worse-learnt modalities, even suppressing the training of others. Hence, it is essential to consider the intrinsic limitation of modality capacity and take all modalities into account during balancing. To this end, we propose the Diagnosing \& Re-learning method. The learning state of each modality is firstly estimated based on the separability of its uni-modal representation space, and then used to softly re-initialize the corresponding uni-modal encoder. In this way, the over-emphasizing of scarcely informative modalities is avoided. In addition, encoders of worse-learnt modalities are enhanced, simultaneously avoiding the over-training of other modalities. Accordingly, multimodal learning is effectively balanced and enhanced. Experiments covering multiple types of modalities and multimodal frameworks demonstrate the superior performance of our simple-yet-effective method for balanced multimodal learning. The source code and dataset are available at \url{https://github.com/GeWu-Lab/Diagnosing_Relearning_ECCV2024}.
Paper Structure (17 sections, 6 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 6 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a): Scarcely informative modality case. It shows the accuracy improvement compared with the joint-training baseline. Only our method has a positive performance improvement. (b)&(c): Uni-modal encoder quality evaluation and comparison. The uni-modal evaluation (Acc audio and Acc vision) is obtained by fine-tuning a new uni-modal classifier with the corresponding trained uni-modal encoder. A larger spot size reflects a better multimodal performance. Our method is superior in both multimodal performance and all uni-modal performance.
  • Figure 2: Illustration of multimodal framework and the proposed Diagnosing & Re-learning method.
  • Figure 3: Uni-modal representation visualization by t-SNE van2008visualizing on CREMA-D dataset. The categories are indicated in different colors. JT denotes for Joint-training.
  • Figure 4: (a): The purity gap between training and validation representation. (b): Changes in test accuracy during training. (c&d): The purity of test representation. All results are based on the CREMA-D dataset.
  • Figure 5: Hyper-parameter sensitivity analysis of $\lambda$ in \ref{['equ:strength']} and Diagnosing & Re-learning frequency $H$ on CREMA-D and Kinetics Sounds datasets.