Table of Contents
Fetching ...

G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation

Mohammed Rakib, Arunkumar Bagavathi

TL;DR

We address modality imbalance in multimodal learning by introducing Gradient-Guided Distillation (G2D), which distills from multiple unimodal teachers to a multimodal student using a composite loss $\\mathcal{L}_{\\text{G2D}}$ that combines supervised multimodal loss with per-modality feature and logit distillation. A Sequential Modality Prioritization (SMP) schedule, guided by unimodal confidence scores $\\rho_t^m$, dynamically prioritizes weaker modalities to reduce dominance and improve balanced optimization. Across six real-world datasets and both classification and regression tasks, G2D with SMP outperforms state-of-the-art baselines, improves feature-space alignment between unimodal and multimodal encoders, and yields more robust performance with minimal inference overhead since teachers are precomputed. The approach is compatible with various fusion strategies and KD formulations, offering a practical path toward more balanced and reliable multimodal systems. Overall, G2D demonstrates that coordinated distillation and adaptive gradient scheduling can meaningfully mitigate modality imbalance in complex multimodal settings.

Abstract

Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation (G$^{2}$D), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. G$^{2}$D further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate G$^{2}$D on multiple real-world datasets and show that G$^{2}$D amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks. Our code is available at https://github.com/rAIson-Lab/G2D.

G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation

TL;DR

We address modality imbalance in multimodal learning by introducing Gradient-Guided Distillation (G2D), which distills from multiple unimodal teachers to a multimodal student using a composite loss that combines supervised multimodal loss with per-modality feature and logit distillation. A Sequential Modality Prioritization (SMP) schedule, guided by unimodal confidence scores , dynamically prioritizes weaker modalities to reduce dominance and improve balanced optimization. Across six real-world datasets and both classification and regression tasks, G2D with SMP outperforms state-of-the-art baselines, improves feature-space alignment between unimodal and multimodal encoders, and yields more robust performance with minimal inference overhead since teachers are precomputed. The approach is compatible with various fusion strategies and KD formulations, offering a practical path toward more balanced and reliable multimodal systems. Overall, G2D demonstrates that coordinated distillation and adaptive gradient scheduling can meaningfully mitigate modality imbalance in complex multimodal settings.

Abstract

Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation (GD), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. GD further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate GD on multiple real-world datasets and show that GD amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks. Our code is available at https://github.com/rAIson-Lab/G2D.

Paper Structure

This paper contains 52 sections, 8 equations, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: Performance of unimodal-only, unimodal in multimodal training, and purely multimodal models on the CREMA-D test set for multimodal classification. (a) audio modality is indifferent to training configurations; (b) video modality is vulnerable to the audio modality in a multimodal setting; (c) performance of the multimodal model is not optimal because of modality imbalance. G2D limits the optimization of superior modality and enhances the video modality to optimize the multimodal performance.
  • Figure 2: G2D consists of multiple, independently optimized unimodal teacher encoders and jointly optimized multimodal student encoders with all encoders generating feature representations and logits for each modality. The $\mathcal{L}_{\text{G2D}}$ Loss Module consists of student loss, feature distillation loss, and logit distillation loss. Confidence scores from the Scoring Module are used by the Sequential Modality Prioritization Module to generate dynamic modulation coefficients that adaptively adjust the gradients of each encoder to ensure balanced contributions.
  • Figure 3: Unimodal teacher confidence scores across multimodal datasets. Each line is the confidence of a specific modality (audio, visual, or text). Modality bias on all datasets, with higher scores for one modality, motivates our use of sequential modality prioritization.
  • Figure 4: Modality gap for AV-MNIST and UR-Funny datasets, with G2D increasing the modality separation compared to joint-training.
  • Figure 5: Alignment between unimodal and multimodal features in the audio encoder (\ref{['fig:audio-encoder']}) and the video encoder (\ref{['fig:video-encoder']}).
  • ...and 1 more figures