G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation
Mohammed Rakib, Arunkumar Bagavathi
TL;DR
We address modality imbalance in multimodal learning by introducing Gradient-Guided Distillation (G2D), which distills from multiple unimodal teachers to a multimodal student using a composite loss $\\mathcal{L}_{\\text{G2D}}$ that combines supervised multimodal loss with per-modality feature and logit distillation. A Sequential Modality Prioritization (SMP) schedule, guided by unimodal confidence scores $\\rho_t^m$, dynamically prioritizes weaker modalities to reduce dominance and improve balanced optimization. Across six real-world datasets and both classification and regression tasks, G2D with SMP outperforms state-of-the-art baselines, improves feature-space alignment between unimodal and multimodal encoders, and yields more robust performance with minimal inference overhead since teachers are precomputed. The approach is compatible with various fusion strategies and KD formulations, offering a practical path toward more balanced and reliable multimodal systems. Overall, G2D demonstrates that coordinated distillation and adaptive gradient scheduling can meaningfully mitigate modality imbalance in complex multimodal settings.
Abstract
Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation (G$^{2}$D), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. G$^{2}$D further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate G$^{2}$D on multiple real-world datasets and show that G$^{2}$D amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks. Our code is available at https://github.com/rAIson-Lab/G2D.
