Distilling Cross-Modal Knowledge via Feature Disentanglement
Junhong Liu, Yuan Zhang, Tao Huang, Wenchao Xu, Renyu Yang
TL;DR
The paper tackles cross-modal knowledge distillation by introducing a frequency-domain feature decoupling framework that separates modality-generic and modality-specific information. Low-frequency components receive strong alignment via $L_{ ext{low}}$ (MSE), while high-frequency components receive weaker alignment via $L_{ ext{high}}$ (LogMSE), complemented by scale-consistency losses and a shared classifier to align feature spaces. Across classification and segmentation benchmarks, the approach consistently outperforms traditional KD and state-of-the-art CMKD methods, demonstrating robust bidirectional transfer and improved cross-modal representations. The findings emphasize the importance of frequency-aware disentanglement and distribution-aware alignment for effective cross-modal knowledge transfer with practical implications for multimodal learning systems.
Abstract
Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observed that low-frequency features exhibit high consistency across different modalities, whereas high-frequency features demonstrate extremely low cross-modal similarity. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. We also propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches. Code is available at https://github.com/Johumliu/FD-CMKD.
