I$^2$RF-TFCKD: Intra-Inter Representation Fusion with Time-Frequency Calibration Knowledge Distillation for Speech Enhancement
Jiaming Cheng, Ruiyu Liang, Ye Ni, Chao Xu, Jing Li, Wei Zhou, Rui Liu, Björn W. Schuller, Xiaoshuai Hao
TL;DR
This work tackles the trade-off between speech-enhancement performance and model efficiency by introducing I2RF-TFCKD, a knowledge-distillation framework that fuses intra-set and inter-set representations while applying time-frequency calibration. The approach leverages a robust teacher–student setup based on the DPDCRN backbone, using recursive residual fusion to generate representative features and TF-aware calibration weights to allocate distillation emphasis across layers. Empirical results on DNS and L3DAS23 demonstrate that intra-inter distillation with TF calibration consistently improves a lightweight student, outperforming several state-of-the-art distillation strategies and achieving competitive results with far fewer parameters and FLOPs. This method offers a practical pathway to deploy high-quality SE on edge devices while maintaining real-time performance and strong perceptual metrics.
Abstract
In this paper, we propose an intra-inter representation fusion knowledge distillation (KD) framework with time-frequency calibration (I$^2$RF-TFCKD) for SE, which achieves distillation through the fusion of multi-layer teacher-student feature flows. Different from previous distillation strategies for SE, the proposed framework fully utilizes the time-frequency differential information of speech while promoting global knowledge flow. Firstly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through residual fusion to form the fused feature set that enables inter-set knowledge interaction. Secondly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. To evaluate the effectiveness of I$^2$RF-TFCKD, we conduct experiments on both single-channel and multi-channel SE datasets. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.
