Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification
Duc-Tuan Truong, Ruijie Tao, Jia Qi Yip, Kong Aik Lee, Eng Siong Chng
TL;DR
The paper identifies a gap in knowledge distillation for automatic speaker verification: non-target speaker information is underutilized when distilling from a large teacher to a smaller student. It proposes a decoupled knowledge distillation framework that explicitly emphasizes non-target speaker knowledge, by splitting the distillation loss into target and non-target components and amplifying the non-target term with a hyperparameter. Experiments on VoxCeleb datasets show the NSKD-emphasized DKD approach consistently improves over embedding-level and conventional KD across three student architectures, with notable gains such as a 28.12% relative improvement in Vox1-O EER for the x-vector and a low EER of 0.590% for CAM++. The results suggest that incorporating richer non-target speaker knowledge yields stronger, more data-efficient ASV models, generalizing across architectures and enabling more compact models to approach or surpass teacher performance.
Abstract
Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automatic speaker verification. In this paper, we first demonstrate that leveraging a larger number of training non-target speakers improves the performance of automatic speaker verification models. Inspired by this finding about the importance of non-target speakers' knowledge, we modified the conventional label-level KD by disentangling and emphasizing the classification probabilities of non-target speakers during knowledge distillation. The proposed method is applied to three different student model architectures and achieves an average of 13.67% improvement in EER on the VoxCeleb dataset compared to embedding-level and conventional label-level KD methods.
