Frequency-mix Knowledge Distillation for Fake Speech Detection

Cunhang Fan; Shunbo Dong; Jun Xue; Yujie Chen; Jiangyan Yi; Zhao Lv

Frequency-mix Knowledge Distillation for Fake Speech Detection

Cunhang Fan, Shunbo Dong, Jun Xue, Yujie Chen, Jiangyan Yi, Zhao Lv

TL;DR

A novel DA method is proposed, Frequency-mix (Freqmix), and the Freqmix knowledge distillation (FKD) is introduced to enhance model information extraction and generalization abilities and achieves state-of-the-art results on ASVspoof 2021 LA dataset.

Abstract

In the telephony scenarios, the fake speech detection (FSD) task to combat speech spoofing attacks is challenging. Data augmentation (DA) methods are considered effective means to address the FSD task in telephony scenarios, typically divided into time domain and frequency domain stages. While each has its advantages, both can result in information loss. To tackle this issue, we propose a novel DA method, Frequency-mix (Freqmix), and introduce the Freqmix knowledge distillation (FKD) to enhance model information extraction and generalization abilities. Specifically, we use Freqmix-enhanced data as input for the teacher model, while the student model's input undergoes time-domain DA method. We use a multi-level feature distillation approach to restore information and improve the model's generalization capabilities. Our approach achieves state-of-the-art results on ASVspoof 2021 LA dataset, showing a 31\% improvement over baseline and performs competitively on ASVspoof 2021 DF dataset.

Frequency-mix Knowledge Distillation for Fake Speech Detection

TL;DR

Abstract

Paper Structure (12 sections, 3 equations, 2 figures, 4 tables)

This paper contains 12 sections, 3 equations, 2 figures, 4 tables.

Introduction
Proposed Method
Freqmix Data Augmentation
Knowledge Distillation
Experiments and Results
Datasets
Experimental Setup
Results
Ablation Study
Performance Comparison With Other Systems
Conclusion
Acknowledgements

Figures (2)

Figure 1: The illustration of our proposed Freqmix knowledge distillation (FKD) for FSD method in telephony scenarios. The student model and the teacher model both adopt the MPIF-Res2Net architecture an identical number of parameters. During the training of the student model, the parameters of the teacher model remain unchanged.
Figure 2: The illustration of the cut and paste operation among the samples in the same batch

Frequency-mix Knowledge Distillation for Fake Speech Detection

TL;DR

Abstract

Frequency-mix Knowledge Distillation for Fake Speech Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (2)