Balancing Multimodal Training Through Game-Theoretic Regularization
Konstantinos Kontras, Thomas Strypsteen, Christos Chatzichristos, Paul Pu Liang, Matthew Blaschko, Maarten De Vos
TL;DR
This work tackles modality competition in multimodal training by introducing the Multimodal Competition Regularizer (MCR), which decomposes joint information into task-relevant unique information, shared information, and task-irrelevant information using MI terms like $I(X_1; Y \mid X_2)$, $I(X_2; Y \mid X_1)$, and $I(X_1; X_2 \mid Y)$. MCR comprises three losses—$\mathcal{L}_{MIPD}$ (for unique information via latent-space perturbations and Jensen-Shannon divergence), $\mathcal{L}_{Con}$ (supervised contrastive loss for shared information), and $\mathcal{L}_{CEB}$ (conditional entropy bottleneck to penalize irrelevant cross-modal content)—implemented within a game-theoretic framework that uses a latent-space permutation strategy to dynamically balance modalities. Across synthetic and real-world datasets (e.g., CREMA-D, AVE, UCF101, MOSI, MOSEI, Something-Something), MCR consistently outperforms baselines, demonstrating robust handling of modality imbalance and improved multimodal learning performance. The findings indicate that balanced, information-theoretic regularization can unlock the benefits of multimodal fusion, with practical implications for scalable, data-efficient multimodal systems; code and models are released at the provided GitHub link.
Abstract
Multimodal learning holds promise for richer information extraction by capturing dependencies across data sources. Yet, current training methods often underperform due to modality competition, a phenomenon where modalities contend for training resources leaving some underoptimized. This raises a pivotal question: how can we address training imbalances, ensure adequate optimization across all modalities, and achieve consistent performance improvements as we transition from unimodal to multimodal data? This paper proposes the Multimodal Competition Regularizer (MCR), inspired by a mutual information (MI) decomposition designed to prevent the adverse effects of competition in multimodal training. Our key contributions are: 1) A game-theoretic framework that adaptively balances modality contributions by encouraging each to maximize its informative role in the final prediction 2) Refining lower and upper bounds for each MI term to enhance the extraction of both task-relevant unique and shared information across modalities. 3) Proposing latent space permutations for conditional MI estimation, significantly improving computational efficiency. MCR outperforms all previously suggested training strategies and simple baseline, clearly demonstrating that training modalities jointly leads to important performance gains on both synthetic and large real-world datasets. We release our code and models at https://github.com/kkontras/MCR.
