Table of Contents
Fetching ...

Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization

Xiaohan Wang, Zhangtao Cheng, Ting Zhong, Leiting Chen, Fan Zhou

TL;DR

This work tackles multi-modal domain generalization (MMDG) by addressing the modality-imbalance problem inherent in weight-averaging approaches. It introduces Modality-Balanced Collaborative Distillation (MBCD), a framework combining Adaptive Modality Dropout, Gradient Consistency, and EMA-based Cross-Modal Distillation to promote balanced, coordinated learning across modalities and steer optimization toward flatter minima. Empirical results on EPIC-Kitchens and HAC show that MBCD consistently outperforms state-of-the-art baselines in both multi-source and single-source DG settings, with notable gains in unseen-domain robustness and strong in-domain performance when leveraging all available modalities. The work advances practical multi-modal DG by ensuring stronger cross-modal fusion and more reliable generalization under distribution shifts, with implications for real-world systems relying on diverse sensing streams.

Abstract

Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA's flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.

Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization

TL;DR

This work tackles multi-modal domain generalization (MMDG) by addressing the modality-imbalance problem inherent in weight-averaging approaches. It introduces Modality-Balanced Collaborative Distillation (MBCD), a framework combining Adaptive Modality Dropout, Gradient Consistency, and EMA-based Cross-Modal Distillation to promote balanced, coordinated learning across modalities and steer optimization toward flatter minima. Empirical results on EPIC-Kitchens and HAC show that MBCD consistently outperforms state-of-the-art baselines in both multi-source and single-source DG settings, with notable gains in unseen-domain robustness and strong in-domain performance when leveraging all available modalities. The work advances practical multi-modal DG by ensuring stronger cross-modal fusion and more reliable generalization under distribution shifts, with implications for real-world systems relying on diverse sensing streams.

Abstract

Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA's flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.

Paper Structure

This paper contains 19 sections, 16 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Illustration of EMA's limitations in MMDG. (a) Comparison on the EPIC-Kitchens test set: the uni-modal models are trained from scratch, while the multi-modal model uses post-hoc classifier training. (b) Performance under varying modality shifts induced by Gaussian noise with different variance (shift level); performance drops sharply as the dominant modality (video) is perturbed. (c) Loss landscape visualization: EMA converges to sharp, biased minima under distributional shifts, while our MBCD yields flatter minima, improved robustness.
  • Figure 2: Overall framework of our MBCD. Our model first performs a uni-modal objective-guided inner-loop update to enhance modality-specific encoders. The updated modality representations are then fused via adaptive modality dropout to mitigate modality imbalance. An EMA-based teacher further guides both uni-modal and fused predictions, promoting stable and modality-balanced learning.
  • Figure 3: Comparison of flatness for different methods on EPIC-Kitchens and HAC.
  • Figure 4: Comparison of modality-wise and fused accuracies on EPIC-Kitchens and HAC.
  • Figure 5: Validation and test accuracy curves on EPIC-Kitchens (D2,D3$\rightarrow$D1) across all modalities.
  • ...and 1 more figures