Table of Contents
Fetching ...

Balancing Multimodal Training Through Game-Theoretic Regularization

Konstantinos Kontras, Thomas Strypsteen, Christos Chatzichristos, Paul Pu Liang, Matthew Blaschko, Maarten De Vos

TL;DR

This work tackles modality competition in multimodal training by introducing the Multimodal Competition Regularizer (MCR), which decomposes joint information into task-relevant unique information, shared information, and task-irrelevant information using MI terms like $I(X_1; Y \mid X_2)$, $I(X_2; Y \mid X_1)$, and $I(X_1; X_2 \mid Y)$. MCR comprises three losses—$\mathcal{L}_{MIPD}$ (for unique information via latent-space perturbations and Jensen-Shannon divergence), $\mathcal{L}_{Con}$ (supervised contrastive loss for shared information), and $\mathcal{L}_{CEB}$ (conditional entropy bottleneck to penalize irrelevant cross-modal content)—implemented within a game-theoretic framework that uses a latent-space permutation strategy to dynamically balance modalities. Across synthetic and real-world datasets (e.g., CREMA-D, AVE, UCF101, MOSI, MOSEI, Something-Something), MCR consistently outperforms baselines, demonstrating robust handling of modality imbalance and improved multimodal learning performance. The findings indicate that balanced, information-theoretic regularization can unlock the benefits of multimodal fusion, with practical implications for scalable, data-efficient multimodal systems; code and models are released at the provided GitHub link.

Abstract

Multimodal learning holds promise for richer information extraction by capturing dependencies across data sources. Yet, current training methods often underperform due to modality competition, a phenomenon where modalities contend for training resources leaving some underoptimized. This raises a pivotal question: how can we address training imbalances, ensure adequate optimization across all modalities, and achieve consistent performance improvements as we transition from unimodal to multimodal data? This paper proposes the Multimodal Competition Regularizer (MCR), inspired by a mutual information (MI) decomposition designed to prevent the adverse effects of competition in multimodal training. Our key contributions are: 1) A game-theoretic framework that adaptively balances modality contributions by encouraging each to maximize its informative role in the final prediction 2) Refining lower and upper bounds for each MI term to enhance the extraction of both task-relevant unique and shared information across modalities. 3) Proposing latent space permutations for conditional MI estimation, significantly improving computational efficiency. MCR outperforms all previously suggested training strategies and simple baseline, clearly demonstrating that training modalities jointly leads to important performance gains on both synthetic and large real-world datasets. We release our code and models at https://github.com/kkontras/MCR.

Balancing Multimodal Training Through Game-Theoretic Regularization

TL;DR

This work tackles modality competition in multimodal training by introducing the Multimodal Competition Regularizer (MCR), which decomposes joint information into task-relevant unique information, shared information, and task-irrelevant information using MI terms like , , and . MCR comprises three losses— (for unique information via latent-space perturbations and Jensen-Shannon divergence), (supervised contrastive loss for shared information), and (conditional entropy bottleneck to penalize irrelevant cross-modal content)—implemented within a game-theoretic framework that uses a latent-space permutation strategy to dynamically balance modalities. Across synthetic and real-world datasets (e.g., CREMA-D, AVE, UCF101, MOSI, MOSEI, Something-Something), MCR consistently outperforms baselines, demonstrating robust handling of modality imbalance and improved multimodal learning performance. The findings indicate that balanced, information-theoretic regularization can unlock the benefits of multimodal fusion, with practical implications for scalable, data-efficient multimodal systems; code and models are released at the provided GitHub link.

Abstract

Multimodal learning holds promise for richer information extraction by capturing dependencies across data sources. Yet, current training methods often underperform due to modality competition, a phenomenon where modalities contend for training resources leaving some underoptimized. This raises a pivotal question: how can we address training imbalances, ensure adequate optimization across all modalities, and achieve consistent performance improvements as we transition from unimodal to multimodal data? This paper proposes the Multimodal Competition Regularizer (MCR), inspired by a mutual information (MI) decomposition designed to prevent the adverse effects of competition in multimodal training. Our key contributions are: 1) A game-theoretic framework that adaptively balances modality contributions by encouraging each to maximize its informative role in the final prediction 2) Refining lower and upper bounds for each MI term to enhance the extraction of both task-relevant unique and shared information across modalities. 3) Proposing latent space permutations for conditional MI estimation, significantly improving computational efficiency. MCR outperforms all previously suggested training strategies and simple baseline, clearly demonstrating that training modalities jointly leads to important performance gains on both synthetic and large real-world datasets. We release our code and models at https://github.com/kkontras/MCR.

Paper Structure

This paper contains 29 sections, 20 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: (Left) Illustration of the conditional mutual information ($\operatorname{CMI}$) terms, $\operatorname{CMI}_1: I(X_1; Y \mid X_2)$ and $\operatorname{CMI}_2: I(X_2; Y \mid X_1)$, representing the unique contributions ($U_1$, $U_2$) of each modality. The shared task-relevant information ($S$) is defined as $I(X_1; X_2) - I(X_1; X_2 \mid Y)$. (Right) Accuracy on a synthetic dataset designed to induce multimodal competition. We vary the ratio of unique information from modality 1 ($U_1$) to shared information ($S$), while keeping the contribution of modality 2 ($U_2$) constant. As the imbalance increases (moving right on the x-axis), the performance of most methods drops. The standard Joint Training (Singleloss) approach shows a steep decline, highlighting its vulnerability to modality competition where one modality dominates and suppresses the other. In contrast, our method, $\operatorname{MCR}$, demonstrates greater robustness by maintaining the highest accuracy and exhibiting the slowest performance degradation. See Section \ref{['sec:synthetic_results']} for more details.
  • Figure 2: Multimodal Competition Regularizer ($\operatorname{MCR}$): The diagram illustrates the $\operatorname{MCR}$ framework, which mitigates modality competition in multimodal learning. Raw data ($X_1$ and $X_2$) are encoded into latent representations ($Z_1$ and $Z_2$), which are then permuted to create $\tilde{Z}_1$ and $\tilde{Z}_2$ and the paired combinations. These combinations are passed through the Fusion Network to produce predicted outputs ($Y$, $\tilde{Y}_1$, $\tilde{Y}_2$). The comparison between predictions reveals each modality's contribution. For example, if $Y \approx \tilde{Y}_1$, it shows that $X_1$ has little impact, and the model relies on $X_2$. The $\operatorname{MCR}$ loss includes three components: $\mathcal{L}_{\operatorname{MIPD}}$ maximize the Jensen-Shannon divergence (JSD) between task output and permuted modality predictions. $\mathcal{L}_{\operatorname{Con}}$ aligns modality representations, while $\mathcal{L}_{\operatorname{CEB}}$ penalizes task-irrelevant information by reconstructing back to the latent space.
  • Figure 3: This figure illustrates a key aspect of our training process, showing how competition strategies between modalities are applied. The gradient multiplier adjusts the video encoder's response to Audio Importance ($\text{Importance}_2$). When $k=1$, the video encoder enhances $\text{Importance}_2$; at $k=0$, it remains neutral, and at $k=-1$, it competes by reducing $\text{Importance}_2$ to prioritize its own ($\text{Importance}_1$). This reflects the principle that increasing the importance of one modality can reduce the importance of the other.
  • Figure 4: Error comparison on the CREMA-D dataset across unimodal and multimodal models ($\operatorname{MCR}$, Ensemble, Joint Training, AGM, MLB). Each matrix summarizes model performance based on unimodal prediction correctness. $\operatorname{MCR}$ performs best when at least one unimodal branch is correct (brown box), effectively preserving modality-specific signals. However, AGM and MLB outperform $\operatorname{MCR}$ when both unimodal predictions fail, in the "Both Wrong" (purple box), indicating stronger synergy in those edge cases. Trends across other datasets are shown in Appendix \ref{['app:posthoc_error_analysis']}, with MOSI being a notable exception where $\operatorname{MCR}$ also excels in synergy.
  • Figure 5: Accuracy of the multimodal model on the CREMA-D dataset across training epochs, showing the performance of the full multimodal model (blue) and individual modality linear probing for audio (orange) and video (green). The dashed red line represents the accuracy of a unimodal ensemble model, highlighting how the model's over-reliance on the audio modality negatively impacts the utilization of the video modality
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition A.1