Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion
QingYuan Jiang, Longfei Huang, Yang Yang
TL;DR
This work tackles modality imbalance in multimodal learning by focusing on disparities in classification ability across modalities. It introduces sustained boosting to jointly minimize classification and residual errors, along with an adaptive classifier assignment mechanism to dynamically strengthen the weak modality. A theoretical result shows the cross-modal loss gap ${\mathcal G}(\Phi)$ converges at rate ${\mathcal O}(1/T)$ under standard smoothness and strong convexity assumptions, providing convergence guarantees for the proposed boosting scheme. Empirically, the method delivers state-of-the-art performance across six diverse multimodal datasets, with good robustness to hyperparameters and modality missing scenarios, and the authors release code for reproducibility.
Abstract
Multimodal learning (MML) is significantly constrained by modality imbalance, leading to suboptimal performance in practice. While existing approaches primarily focus on balancing the learning of different modalities to address this issue, they fundamentally overlook the inherent disproportion in model classification ability, which serves as the primary cause of this phenomenon. In this paper, we propose a novel multimodal learning approach to dynamically balance the classification ability of weak and strong modalities by incorporating the principle of boosting. Concretely, we first propose a sustained boosting algorithm in multimodal learning by simultaneously optimizing the classification and residual errors. Subsequently, we introduce an adaptive classifier assignment strategy to dynamically facilitate the classification performance of the weak modality. Furthermore, we theoretically analyze the convergence property of the cross-modal gap function, ensuring the effectiveness of the proposed boosting scheme. To this end, the classification ability of strong and weak modalities is expected to be balanced, thereby mitigating the imbalance issue. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SOTA) multimodal learning baselines. The source code is available at https://github.com/njustkmg/NeurIPS25-AUG.
