Table of Contents
Fetching ...

Distilling Cross-Modal Knowledge via Feature Disentanglement

Junhong Liu, Yuan Zhang, Tao Huang, Wenchao Xu, Renyu Yang

TL;DR

The paper tackles cross-modal knowledge distillation by introducing a frequency-domain feature decoupling framework that separates modality-generic and modality-specific information. Low-frequency components receive strong alignment via $L_{ ext{low}}$ (MSE), while high-frequency components receive weaker alignment via $L_{ ext{high}}$ (LogMSE), complemented by scale-consistency losses and a shared classifier to align feature spaces. Across classification and segmentation benchmarks, the approach consistently outperforms traditional KD and state-of-the-art CMKD methods, demonstrating robust bidirectional transfer and improved cross-modal representations. The findings emphasize the importance of frequency-aware disentanglement and distribution-aware alignment for effective cross-modal knowledge transfer with practical implications for multimodal learning systems.

Abstract

Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observed that low-frequency features exhibit high consistency across different modalities, whereas high-frequency features demonstrate extremely low cross-modal similarity. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. We also propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches. Code is available at https://github.com/Johumliu/FD-CMKD.

Distilling Cross-Modal Knowledge via Feature Disentanglement

TL;DR

The paper tackles cross-modal knowledge distillation by introducing a frequency-domain feature decoupling framework that separates modality-generic and modality-specific information. Low-frequency components receive strong alignment via (MSE), while high-frequency components receive weaker alignment via (LogMSE), complemented by scale-consistency losses and a shared classifier to align feature spaces. Across classification and segmentation benchmarks, the approach consistently outperforms traditional KD and state-of-the-art CMKD methods, demonstrating robust bidirectional transfer and improved cross-modal representations. The findings emphasize the importance of frequency-aware disentanglement and distribution-aware alignment for effective cross-modal knowledge transfer with practical implications for multimodal learning systems.

Abstract

Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observed that low-frequency features exhibit high consistency across different modalities, whereas high-frequency features demonstrate extremely low cross-modal similarity. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. We also propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches. Code is available at https://github.com/Johumliu/FD-CMKD.

Paper Structure

This paper contains 31 sections, 10 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The comparison of feature mean value differences across modalities.
  • Figure 2: Framework of our method. We decouple the features of different modalities in the frequency domain into high-frequency and low-frequency components. For low-frequency features, MSE loss is applied, while logMSE loss is used for high-frequency features. Additionally, we ensure consistency in feature scale and feature space across modalities through feature normalization and alignment modules.
  • Figure 3: Comparison of value and gradient between MSE and logMSE losses.
  • Figure 4: t-SNE visualization comparison between the conventional feature distillation method and our proposed approach. We visualize the features of different modalities on the CREMAD test set. T represents the teacher modality.
  • Figure 5: Sensitivity study of high and low-frequency loss weight.