Table of Contents
Fetching ...

DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning

Chengxuan Qian, Kai Han, Jiaxin Liu, Zhenlong Yuan, Zhengzhong Zhu, Jingchao Wang, Chongwen Lyu, Jun Chen, Zhe Liu

TL;DR

DynCIM tackles modality and sample imbalances in multimodal learning by introducing a dual curriculum: a sample-level difficulty assessment based on prediction deviation, consistency, and stability, and a modality-level curriculum using global (Geometric Mean Ratio) and local (Harmonic Mean Improvement Rate) measures. A gating-based dynamic fusion mechanism then adapts modality contributions in real time, guided by an adaptive balance between overall fusion effectiveness and individual modality optimization. The framework optimizes a joint curriculum objective that reweights informative samples and fusion signals, and extensive experiments on six benchmarks show consistent improvements over state-of-the-art methods with competitive computational efficiency. This approach enhances inter-modal cooperation, robustness to noise, and convergence speed, making multimodal models more scalable and reliable in heterogeneous data regimes.

Abstract

Multimodal learning integrates complementary information from diverse modalities to enhance the decision-making process. However, the potential of multimodal collaboration remains under-exploited due to disparities in data quality and modality representation capabilities. To address this, we introduce DynCIM, a novel dynamic curriculum learning framework designed to quantify the inherent imbalances from both sample and modality perspectives. DynCIM employs a sample-level curriculum to dynamically assess each sample's difficulty according to prediction deviation, consistency, and stability, while a modality-level curriculum measures modality contributions from global and local. Furthermore, a gating-based dynamic fusion mechanism is introduced to adaptively adjust modality contributions, minimizing redundancy and optimizing fusion effectiveness. Extensive experiments on six multimodal benchmarking datasets, spanning both bimodal and trimodal scenarios, demonstrate that DynCIM consistently outperforms state-of-the-art methods. Our approach effectively mitigates modality and sample imbalances while enhancing adaptability and robustness in multimodal learning tasks. Our code is available at https://github.com/Raymond-Qiancx/DynCIM.

DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning

TL;DR

DynCIM tackles modality and sample imbalances in multimodal learning by introducing a dual curriculum: a sample-level difficulty assessment based on prediction deviation, consistency, and stability, and a modality-level curriculum using global (Geometric Mean Ratio) and local (Harmonic Mean Improvement Rate) measures. A gating-based dynamic fusion mechanism then adapts modality contributions in real time, guided by an adaptive balance between overall fusion effectiveness and individual modality optimization. The framework optimizes a joint curriculum objective that reweights informative samples and fusion signals, and extensive experiments on six benchmarks show consistent improvements over state-of-the-art methods with competitive computational efficiency. This approach enhances inter-modal cooperation, robustness to noise, and convergence speed, making multimodal models more scalable and reliable in heterogeneous data regimes.

Abstract

Multimodal learning integrates complementary information from diverse modalities to enhance the decision-making process. However, the potential of multimodal collaboration remains under-exploited due to disparities in data quality and modality representation capabilities. To address this, we introduce DynCIM, a novel dynamic curriculum learning framework designed to quantify the inherent imbalances from both sample and modality perspectives. DynCIM employs a sample-level curriculum to dynamically assess each sample's difficulty according to prediction deviation, consistency, and stability, while a modality-level curriculum measures modality contributions from global and local. Furthermore, a gating-based dynamic fusion mechanism is introduced to adaptively adjust modality contributions, minimizing redundancy and optimizing fusion effectiveness. Extensive experiments on six multimodal benchmarking datasets, spanning both bimodal and trimodal scenarios, demonstrate that DynCIM consistently outperforms state-of-the-art methods. Our approach effectively mitigates modality and sample imbalances while enhancing adaptability and robustness in multimodal learning tasks. Our code is available at https://github.com/Raymond-Qiancx/DynCIM.

Paper Structure

This paper contains 23 sections, 14 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: (a) Performance gaps across modalities, indicating the dominance of certain modalities and the inherent limitations of naive fusion. (b) Variations in convergence rates on the Kinetics Sounds dataset, uncovering modality-specific learning dynamics. (c) Our approach consistently outperforms baselines on both bimodal and trimodal datasets, demonstrating its adaptability across different multimodal scenarios.
  • Figure 2: Different samples within the same modality exhibit varying levels of granularity, with the left image depicting the overall action category and contextual environment, while the right image highlights fine-grained movement details.
  • Figure 3: Overview of our proposed DynCIM curriculum learning framework. DynCIM leverages unimodal encoders to encode multimodal inputs, yielding outputs and byproducts of backpropagation. The Sample-level Curriculum accounts for sample-level imbalances by evaluating task-specific difficulty based on prediction deviation, consistency, and stability. Meanwhile, the Modality-level Curriculum adopts a global-to-local approach, introducing the Geometric Mean Ratio (GMR) to evaluate overall modality impact and the Harmonic Mean Improvement Ratio (HMIR) to capture fine-grained variations. A gating mechanism further regulates modality contributions, dynamically adjusting weights to mitigate modality imbalances. Finally, the refined multimodal representations are fused to generate the final prediction output, ensuring an adaptive and progressive learning process.
  • Figure 4: Visualization of the modality gap between Audio and Visual on Kinetic Sounds dataset.
  • Figure 5: t-SNE visualization van2008tsne of feature distributions on KS dataset, comparing Concatenation, the method without SDC or MDC, and the complete method, with categories shown in different colors.
  • ...and 2 more figures