Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization
Xiaohan Wang, Zhangtao Cheng, Ting Zhong, Leiting Chen, Fan Zhou
TL;DR
This work tackles multi-modal domain generalization (MMDG) by addressing the modality-imbalance problem inherent in weight-averaging approaches. It introduces Modality-Balanced Collaborative Distillation (MBCD), a framework combining Adaptive Modality Dropout, Gradient Consistency, and EMA-based Cross-Modal Distillation to promote balanced, coordinated learning across modalities and steer optimization toward flatter minima. Empirical results on EPIC-Kitchens and HAC show that MBCD consistently outperforms state-of-the-art baselines in both multi-source and single-source DG settings, with notable gains in unseen-domain robustness and strong in-domain performance when leveraging all available modalities. The work advances practical multi-modal DG by ensuring stronger cross-modal fusion and more reliable generalization under distribution shifts, with implications for real-world systems relying on diverse sensing streams.
Abstract
Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA's flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.
