Table of Contents
Fetching ...

Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection

Cunhang Fan, Mingming Ding, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Zhao Lv

TL;DR

This work tackles degraded synthetic speech detection performance in realistic noisy conditions by introducing DKDSSD, a dual-branch framework that couples a clean teacher with a noisy student. It combines an interactive fusion module to adaptively merge denoised and original noisy features with an online knowledge-distillation scheme and joint training to align the student’s decisions with the teacher’s. Key contributions include the interactive fusion design, response-based distillation, and a joint optimization strategy, validated across multiple ASVspoof datasets and noise conditions. The results demonstrate improved noise robustness and strong cross-dataset generalization, offering practical benefits for robust SSD deployment in real-world, noisy environments.

Abstract

Most research in synthetic speech detection (SSD) focuses on improving performance on standard noise-free datasets. However, in actual situations, noise interference is usually present, causing significant performance degradation in SSD systems. To improve noise robustness, this paper proposes a dual-branch knowledge distillation synthetic speech detection (DKDSSD) method. Specifically, a parallel data flow of the clean teacher branch and the noisy student branch is designed, and interactive fusion module and response-based teacher-student paradigms are proposed to guide the training of noisy data from both the data distribution and decision-making perspectives. In the noisy student branch, speech enhancement is introduced initially for denoising, aiming to reduce the interference of strong noise. The proposed interactive fusion combines denoised features and noisy features to mitigate the impact of speech distortion and ensure consistency with the data distribution of the clean branch. The teacher-student paradigm maps the student's decision space to the teacher's decision space, enabling noisy speech to behave similarly to clean speech. Additionally, a joint training method is employed to optimize both branches for achieving global optimality. Experimental results based on multiple datasets demonstrate that the proposed method performs effectively in noisy environments and maintains its performance in cross-dataset experiments. Source code is available at https://github.com/fchest/DKDSSD.

Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection

TL;DR

This work tackles degraded synthetic speech detection performance in realistic noisy conditions by introducing DKDSSD, a dual-branch framework that couples a clean teacher with a noisy student. It combines an interactive fusion module to adaptively merge denoised and original noisy features with an online knowledge-distillation scheme and joint training to align the student’s decisions with the teacher’s. Key contributions include the interactive fusion design, response-based distillation, and a joint optimization strategy, validated across multiple ASVspoof datasets and noise conditions. The results demonstrate improved noise robustness and strong cross-dataset generalization, offering practical benefits for robust SSD deployment in real-world, noisy environments.

Abstract

Most research in synthetic speech detection (SSD) focuses on improving performance on standard noise-free datasets. However, in actual situations, noise interference is usually present, causing significant performance degradation in SSD systems. To improve noise robustness, this paper proposes a dual-branch knowledge distillation synthetic speech detection (DKDSSD) method. Specifically, a parallel data flow of the clean teacher branch and the noisy student branch is designed, and interactive fusion module and response-based teacher-student paradigms are proposed to guide the training of noisy data from both the data distribution and decision-making perspectives. In the noisy student branch, speech enhancement is introduced initially for denoising, aiming to reduce the interference of strong noise. The proposed interactive fusion combines denoised features and noisy features to mitigate the impact of speech distortion and ensure consistency with the data distribution of the clean branch. The teacher-student paradigm maps the student's decision space to the teacher's decision space, enabling noisy speech to behave similarly to clean speech. Additionally, a joint training method is employed to optimize both branches for achieving global optimality. Experimental results based on multiple datasets demonstrate that the proposed method performs effectively in noisy environments and maintains its performance in cross-dataset experiments. Source code is available at https://github.com/fchest/DKDSSD.
Paper Structure (33 sections, 11 equations, 6 figures, 9 tables)

This paper contains 33 sections, 11 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: The structure of (a) traditional synthetic speech detection, using clean speech training or using noisy speech for multi-condition training (MCT); (b) cascade or joint training with a speech enhancement front end for noisy scenarios; (c) The overall joint training structure of the proposed dual-branch knowledge distillation synthetic speech detection system. The teacher model requires clean speech as input and is trained in parallel with the student model.
  • Figure 2: The structure and details of the dual-branch knowledge distillation synthetic speech detection system. Green boxes represent interactive fusion, and red boxes represent knowledge distillation. The teacher model classifier has the same structure as the student model. $C$ indicates the concatenation operation, $\otimes$ means element-wise multiplication.
  • Figure 3: Schematic diagram of the structure of the teacher-student model. The teacher and student are trained in parallel, and the classification loss is calculated with the ground truth label respectively, and the loss of the soft targets between the teacher and the student is calculated at the same time. The degree of softening is controlled via the temperature hyperparameter T.
  • Figure 4: The t-SNE plots of all baseline models on the ASVspoof 2019 unseen test set (left column) and ASVspoof 2021 LA test set (right column).
  • Figure 5: A feature visualization of LA_E_2178426_snr15_babble.wav. This speech is generated by LA_E_2178426.wav in ASVspoof 2019 by superimposing babble noise with an SNR of 15dB. From top to bottom are the logarithmic magnitude spectrum of the noisy speech and the enhanced speech, the fusion weight matrix, and the mean value of the fused 16-channel features.
  • ...and 1 more figures