Table of Contents
Fetching ...

Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

Duc-Tuan Truong, Ruijie Tao, Jia Qi Yip, Kong Aik Lee, Eng Siong Chng

TL;DR

The paper identifies a gap in knowledge distillation for automatic speaker verification: non-target speaker information is underutilized when distilling from a large teacher to a smaller student. It proposes a decoupled knowledge distillation framework that explicitly emphasizes non-target speaker knowledge, by splitting the distillation loss into target and non-target components and amplifying the non-target term with a hyperparameter. Experiments on VoxCeleb datasets show the NSKD-emphasized DKD approach consistently improves over embedding-level and conventional KD across three student architectures, with notable gains such as a 28.12% relative improvement in Vox1-O EER for the x-vector and a low EER of 0.590% for CAM++. The results suggest that incorporating richer non-target speaker knowledge yields stronger, more data-efficient ASV models, generalizing across architectures and enabling more compact models to approach or surpass teacher performance.

Abstract

Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automatic speaker verification. In this paper, we first demonstrate that leveraging a larger number of training non-target speakers improves the performance of automatic speaker verification models. Inspired by this finding about the importance of non-target speakers' knowledge, we modified the conventional label-level KD by disentangling and emphasizing the classification probabilities of non-target speakers during knowledge distillation. The proposed method is applied to three different student model architectures and achieves an average of 13.67% improvement in EER on the VoxCeleb dataset compared to embedding-level and conventional label-level KD methods.

Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

TL;DR

The paper identifies a gap in knowledge distillation for automatic speaker verification: non-target speaker information is underutilized when distilling from a large teacher to a smaller student. It proposes a decoupled knowledge distillation framework that explicitly emphasizes non-target speaker knowledge, by splitting the distillation loss into target and non-target components and amplifying the non-target term with a hyperparameter. Experiments on VoxCeleb datasets show the NSKD-emphasized DKD approach consistently improves over embedding-level and conventional KD across three student architectures, with notable gains such as a 28.12% relative improvement in Vox1-O EER for the x-vector and a low EER of 0.590% for CAM++. The results suggest that incorporating richer non-target speaker knowledge yields stronger, more data-efficient ASV models, generalizing across architectures and enabling more compact models to approach or surpass teacher performance.

Abstract

Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automatic speaker verification. In this paper, we first demonstrate that leveraging a larger number of training non-target speakers improves the performance of automatic speaker verification models. Inspired by this finding about the importance of non-target speakers' knowledge, we modified the conventional label-level KD by disentangling and emphasizing the classification probabilities of non-target speakers during knowledge distillation. The proposed method is applied to three different student model architectures and achieves an average of 13.67% improvement in EER on the VoxCeleb dataset compared to embedding-level and conventional label-level KD methods.
Paper Structure (14 sections, 8 equations, 2 figures, 2 tables)

This paper contains 14 sections, 8 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Vox1-O results (EER %) of x-vector model trained on a fixed number of utterances but varying numbers of speakers.
  • Figure 2: Our Decoupled Knowledge Distillation (DKD) with an emphasis on non-target speaker knowledge in comparison with the embedding-level knowledge distillation (using cosine distance loss $\mathcal{L}_{\text{COS}}$) and the conventional label-level knowledge distillation (using Kullback–Leibler divergence loss $\mathcal{L}_{\text{KD}}$). $\mathcal{T}$, $\mathcal{S}$, $K$, and $\tau$ denote the teacher model, the student model, the number of training speakers, and the target speaker, respectively. $p_{i}$, $p_{\bar{\tau}}$, and $\hat{p}_i$ are respectively defined as Eq.(\ref{['eq:softmax']}) and Eq.(\ref{['eq:nc_softmax']}). $\mathcal{L}_{\text{TSKD}}$, $\mathcal{L}_{\text{NSKD}}$ and $\gamma$ are defined as Eq.(\ref{['eq:bf_final_kld']}) and Eq.(\ref{['eq:final_dkd']}), respectively.