Table of Contents
Fetching ...

See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias

JuneHyoung Kwon, MiHyeon Kim, Eunju Lee, Juhwan Choi, YoungBin Kim

TL;DR

Dominant modality bias in vision–language models arises when one modality drives predictions, hindering balanced multimodal integration. The paper introduces BalGrad, a gradient-based framework that (i) reweights KL-divergence gradients between modalities according to each modality’s learning status and (ii) projects the target-task gradient to avoid conflicts with the KL gradient, ensuring balanced convergence. Theoretical analysis links gradient magnitude and direction to loss reduction, and extensive experiments on UPMC Food-101, Hateful Memes, MM-IMDb, and additional datasets demonstrate reduced modality gaps, improved robustness to impairment, and applicability to decoder-based VL models. This approach offers a practical pathway to suppress negative transfer while preserving cross-modal integration in real-world multimodal systems.

Abstract

Vision-language (VL) models have demonstrated strong performance across various tasks. However, these models often rely on a specific modality for predictions, leading to "dominant modality bias.'' This bias significantly hurts performance, especially when one modality is impaired. In this study, we analyze model behavior under dominant modality bias and theoretically show that unaligned gradients or differences in gradient magnitudes prevent balanced convergence of the loss. Based on these findings, we propose a novel framework, BalGrad to mitigate dominant modality bias. Our approach includes inter-modality gradient reweighting, adjusting the gradient of KL divergence based on each modality's contribution, and inter-task gradient projection to align task directions in a non-conflicting manner. Experiments on UPMC Food-101, Hateful Memes, and MM-IMDb datasets confirm that BalGrad effectively alleviates over-reliance on specific modalities when making predictions.

See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias

TL;DR

Dominant modality bias in vision–language models arises when one modality drives predictions, hindering balanced multimodal integration. The paper introduces BalGrad, a gradient-based framework that (i) reweights KL-divergence gradients between modalities according to each modality’s learning status and (ii) projects the target-task gradient to avoid conflicts with the KL gradient, ensuring balanced convergence. Theoretical analysis links gradient magnitude and direction to loss reduction, and extensive experiments on UPMC Food-101, Hateful Memes, MM-IMDb, and additional datasets demonstrate reduced modality gaps, improved robustness to impairment, and applicability to decoder-based VL models. This approach offers a practical pathway to suppress negative transfer while preserving cross-modal integration in real-world multimodal systems.

Abstract

Vision-language (VL) models have demonstrated strong performance across various tasks. However, these models often rely on a specific modality for predictions, leading to "dominant modality bias.'' This bias significantly hurts performance, especially when one modality is impaired. In this study, we analyze model behavior under dominant modality bias and theoretically show that unaligned gradients or differences in gradient magnitudes prevent balanced convergence of the loss. Based on these findings, we propose a novel framework, BalGrad to mitigate dominant modality bias. Our approach includes inter-modality gradient reweighting, adjusting the gradient of KL divergence based on each modality's contribution, and inter-task gradient projection to align task directions in a non-conflicting manner. Experiments on UPMC Food-101, Hateful Memes, and MM-IMDb datasets confirm that BalGrad effectively alleviates over-reliance on specific modalities when making predictions.

Paper Structure

This paper contains 24 sections, 13 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Conceptual visualization of dominant modality bias. The key modality differs by task: (a) For the hate recognition task, text descriptions of memes lead, while (b) for the food classification task, food images play a crucial role in prediction.
  • Figure 2: Experimental results on the UPMC Food-101, Hateful Memes, and MM-IMDb datasets in the presence of dominant modality bias. (a) Performance visualization under different missing conditions (full, image only (missing text), text only (missing image)) for each dataset. (b) Illustration of learning curves for each modality across datasets.
  • Figure 3: (a) The overall training framework of our proposed BalGrad. The final classifier $f_{\mathcal{T}}(\cdot)$ is updated with the gradient $g^{\perp}_{\mathcal{T}}$ for cross entropy (CE) loss. The image and text embedding layers $h_v(\cdot), h_l(\cdot)$ are also updated with $g^{\perp}_{\mathcal{T}}$ along with the gradients of the CE loss for each modality $g^v_{\mathcal{T}}, g^l_{\mathcal{T}}$, and the gradients of the KL divergence between the two modalities' predictions $g^v_{kl}, g^l_{kl}$. (b) Inter-modality gradient reweighting adjusts the magnitudes of $g^v_{kl}$ and $g^l_{kl}$ to obtain $g_{kl}$. If a conflict occurs, we project $g^{\perp}_{\mathcal{T}}$ on the orthogonal direction of $g_{kl}$ by inter-task gradient projection.
  • Figure 4: Evaluation on robustness to different missing ratio $r$ of BalGrad and existing methods on UPMC Food-101, Hateful Memes, and MM-IMDb datasets.
  • Figure 5: Bar plots comparing the performance of existing methods and BalGrad using BLIP. Each bar represents $\Delta_{\textit{Gap}}$(%), defined as the performance difference between missing image and missing text conditions.
  • ...and 4 more figures

Theorems & Definitions (4)

  • proof
  • proof
  • proof
  • proof