Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

Duc-Tuan Truong; Ruijie Tao; Jia Qi Yip; Kong Aik Lee; Eng Siong Chng

Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

Duc-Tuan Truong, Ruijie Tao, Jia Qi Yip, Kong Aik Lee, Eng Siong Chng

TL;DR

The paper identifies a gap in knowledge distillation for automatic speaker verification: non-target speaker information is underutilized when distilling from a large teacher to a smaller student. It proposes a decoupled knowledge distillation framework that explicitly emphasizes non-target speaker knowledge, by splitting the distillation loss into target and non-target components and amplifying the non-target term with a hyperparameter. Experiments on VoxCeleb datasets show the NSKD-emphasized DKD approach consistently improves over embedding-level and conventional KD across three student architectures, with notable gains such as a 28.12% relative improvement in Vox1-O EER for the x-vector and a low EER of 0.590% for CAM++. The results suggest that incorporating richer non-target speaker knowledge yields stronger, more data-efficient ASV models, generalizing across architectures and enabling more compact models to approach or surpass teacher performance.

Abstract

Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automatic speaker verification. In this paper, we first demonstrate that leveraging a larger number of training non-target speakers improves the performance of automatic speaker verification models. Inspired by this finding about the importance of non-target speakers' knowledge, we modified the conventional label-level KD by disentangling and emphasizing the classification probabilities of non-target speakers during knowledge distillation. The proposed method is applied to three different student model architectures and achieves an average of 13.67% improvement in EER on the VoxCeleb dataset compared to embedding-level and conventional label-level KD methods.

Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

TL;DR

Abstract

Paper Structure (14 sections, 8 equations, 2 figures, 2 tables)

This paper contains 14 sections, 8 equations, 2 figures, 2 tables.

Introduction
Methodology
The impact of non-target speakers for ASV
Rethinking conventional label-level KD
Decoupled Knowledge Distillation with an emphasis on non-target speaker knowledge
Experiments Setup
Dataset
Model
Training and Evaluation
Results and Analysis
Results of the proposed method
Ablation Study: The impact of $\mathcal{L}_\text{NSKD}$
Conclusion
Acknowledgement

Figures (2)

Figure 1: Vox1-O results (EER %) of x-vector model trained on a fixed number of utterances but varying numbers of speakers.
Figure 2: Our Decoupled Knowledge Distillation (DKD) with an emphasis on non-target speaker knowledge in comparison with the embedding-level knowledge distillation (using cosine distance loss $\mathcal{L}_{\text{COS}}$) and the conventional label-level knowledge distillation (using Kullback–Leibler divergence loss $\mathcal{L}_{\text{KD}}$). $\mathcal{T}$, $\mathcal{S}$, $K$, and $\tau$ denote the teacher model, the student model, the number of training speakers, and the target speaker, respectively. $p_{i}$, $p_{\bar{\tau}}$, and $\hat{p}_i$ are respectively defined as Eq.(\ref{['eq:softmax']}) and Eq.(\ref{['eq:nc_softmax']}). $\mathcal{L}_{\text{TSKD}}$, $\mathcal{L}_{\text{NSKD}}$ and $\gamma$ are defined as Eq.(\ref{['eq:bf_final_kld']}) and Eq.(\ref{['eq:final_dkd']}), respectively.

Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

TL;DR

Abstract

Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

Authors

TL;DR

Abstract

Table of Contents

Figures (2)