Table of Contents
Fetching ...

Modality Equilibrium Matters: Minor-Modality-Aware Adaptive Alternating for Cross-Modal Memory Enhancement

Xiang Shi, Rui Zhang, Jiawei Liu, Yinpeng Liu, Qikai Cheng, Wei Lu

Abstract

Multimodal fusion is susceptible to modality imbalance, where dominant modalities overshadow weak ones, easily leading to biased learning and suboptimal fusion, especially for incomplete modality conditions. To address this problem, we propose a Shapley-guided alternating training framework that adaptively prioritizes minor modalities to balance and thus enhance the fusion. Our method leverages Shapley Value-based scheduling to improve the training sequence adaptively, ensuring that under-optimized modalities receive sufficient learning. Additionally, we introduce the memory module to refine and inherit modality-specific representations with a cross-modal mapping mechanism to align features at both the feature and sample levels. To further validate the adaptability of the proposed approach, the encoder module empirically adopts both conventional and LLM-based backbones. With building up a novel multimodal equilibrium metric, namely, equilibrium deviation metric (EDM), we evaluate the performance in both balance and accuracy across four multimodal benchmark datasets, where our method achieves state-of-the-art (SOTA) results. Meanwhile, robustness analysis under missing modalities highlights its strong generalization capabilities. Accordingly, our findings reveal the untapped potential of alternating training, demonstrating that strategic modality prioritization fundamentally balances and promotes multimodal learning, offering a new paradigm for optimizing multimodal training dynamics.

Modality Equilibrium Matters: Minor-Modality-Aware Adaptive Alternating for Cross-Modal Memory Enhancement

Abstract

Multimodal fusion is susceptible to modality imbalance, where dominant modalities overshadow weak ones, easily leading to biased learning and suboptimal fusion, especially for incomplete modality conditions. To address this problem, we propose a Shapley-guided alternating training framework that adaptively prioritizes minor modalities to balance and thus enhance the fusion. Our method leverages Shapley Value-based scheduling to improve the training sequence adaptively, ensuring that under-optimized modalities receive sufficient learning. Additionally, we introduce the memory module to refine and inherit modality-specific representations with a cross-modal mapping mechanism to align features at both the feature and sample levels. To further validate the adaptability of the proposed approach, the encoder module empirically adopts both conventional and LLM-based backbones. With building up a novel multimodal equilibrium metric, namely, equilibrium deviation metric (EDM), we evaluate the performance in both balance and accuracy across four multimodal benchmark datasets, where our method achieves state-of-the-art (SOTA) results. Meanwhile, robustness analysis under missing modalities highlights its strong generalization capabilities. Accordingly, our findings reveal the untapped potential of alternating training, demonstrating that strategic modality prioritization fundamentally balances and promotes multimodal learning, offering a new paradigm for optimizing multimodal training dynamics.

Paper Structure

This paper contains 28 sections, 1 theorem, 25 equations, 9 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Weak-to-Strong ordering improves training under modality imbalance with memory transfer. Consider a fusion objective $\mathcal{L}_{\text{fusion}}(\theta) = \alpha_1 \mathcal{L}_1(\theta, \mathbf{H}^{(1)}) + \alpha_2 \mathcal{L}_2(\theta, \mathbf{H}^{(2)})$, where $\alpha_1 \ll \alpha_2$ captures mod Proof sketch: See Appendix A.

Figures (9)

  • Figure 1: Architecture of the proposed weak-to-strong alternating training framework. Modalities are sequentially optimized based on contribution deviation, with memory carried across steps to enable progressive fusion and correction.
  • Figure 2: Order analysis. Comparison of initial update sequences and order scheduling strategies.
  • Figure 3: EDM score analysis. Modality contribution deviation shifts under varying encoders.
  • Figure 4: Feature distribution analysis. Evolution of feature distributions over training epochs.
  • Figure 5: Effect of threshold $\tau$ on modality-specific and fused accuracy on IEMOCAP.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Theorem 1