Table of Contents
Fetching ...

I$^2$RF-TFCKD: Intra-Inter Representation Fusion with Time-Frequency Calibration Knowledge Distillation for Speech Enhancement

Jiaming Cheng, Ruiyu Liang, Ye Ni, Chao Xu, Jing Li, Wei Zhou, Rui Liu, Björn W. Schuller, Xiaoshuai Hao

TL;DR

This work tackles the trade-off between speech-enhancement performance and model efficiency by introducing I2RF-TFCKD, a knowledge-distillation framework that fuses intra-set and inter-set representations while applying time-frequency calibration. The approach leverages a robust teacher–student setup based on the DPDCRN backbone, using recursive residual fusion to generate representative features and TF-aware calibration weights to allocate distillation emphasis across layers. Empirical results on DNS and L3DAS23 demonstrate that intra-inter distillation with TF calibration consistently improves a lightweight student, outperforming several state-of-the-art distillation strategies and achieving competitive results with far fewer parameters and FLOPs. This method offers a practical pathway to deploy high-quality SE on edge devices while maintaining real-time performance and strong perceptual metrics.

Abstract

In this paper, we propose an intra-inter representation fusion knowledge distillation (KD) framework with time-frequency calibration (I$^2$RF-TFCKD) for SE, which achieves distillation through the fusion of multi-layer teacher-student feature flows. Different from previous distillation strategies for SE, the proposed framework fully utilizes the time-frequency differential information of speech while promoting global knowledge flow. Firstly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through residual fusion to form the fused feature set that enables inter-set knowledge interaction. Secondly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. To evaluate the effectiveness of I$^2$RF-TFCKD, we conduct experiments on both single-channel and multi-channel SE datasets. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.

I$^2$RF-TFCKD: Intra-Inter Representation Fusion with Time-Frequency Calibration Knowledge Distillation for Speech Enhancement

TL;DR

This work tackles the trade-off between speech-enhancement performance and model efficiency by introducing I2RF-TFCKD, a knowledge-distillation framework that fuses intra-set and inter-set representations while applying time-frequency calibration. The approach leverages a robust teacher–student setup based on the DPDCRN backbone, using recursive residual fusion to generate representative features and TF-aware calibration weights to allocate distillation emphasis across layers. Empirical results on DNS and L3DAS23 demonstrate that intra-inter distillation with TF calibration consistently improves a lightweight student, outperforming several state-of-the-art distillation strategies and achieving competitive results with far fewer parameters and FLOPs. This method offers a practical pathway to deploy high-quality SE on edge devices while maintaining real-time performance and strong perceptual metrics.

Abstract

In this paper, we propose an intra-inter representation fusion knowledge distillation (KD) framework with time-frequency calibration (IRF-TFCKD) for SE, which achieves distillation through the fusion of multi-layer teacher-student feature flows. Different from previous distillation strategies for SE, the proposed framework fully utilizes the time-frequency differential information of speech while promoting global knowledge flow. Firstly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through residual fusion to form the fused feature set that enables inter-set knowledge interaction. Secondly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. To evaluate the effectiveness of IRF-TFCKD, we conduct experiments on both single-channel and multi-channel SE datasets. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.

Paper Structure

This paper contains 30 sections, 11 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Backbone network architecture of the teacher model.
  • Figure 2: Overall architecture of the I2RF-TFCKD framework. We present the detailed process of intra-inter set distillation. Fig. \ref{['fig_overall_arch']}(a) shows the time-frequency cross-calibration knowledge transfer and the recursive feature fusion across different layers within a single correlated set. Fig. \ref{['fig_overall_arch']}(b) demonstrates inter-set distillation achieved via the fused feature set among various correlated sets. Fig. \ref{['fig_overall_arch']}(c) visualizes the process of generating fused features by utilizing the features of the current layer and the inherited recursive features.
  • Figure 3: (a) using single-level teacher knowledge to guide one-level learning of the student. (b) using multiple layers of the teacher to supervise one layer in the student.
  • Figure 4: Similarity mapping of time and frequency flows. The self-similarity matrices for distillation are calculated in two flows: the time domain and the frequency domain. The distribution of features along the time axis will affect the similarity computation in the time flow, whereas the frequency domain processing operates frame-wise.
  • Figure 5: Training curve trends on the DNS validation set.
  • ...and 3 more figures