Table of Contents
Fetching ...

Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples

Zhenyu Wang, John H. L. Hansen

TL;DR

This work tackles the challenge of robust spoofing detection for ASV by proposing a unified framework that integrates a RawNet2-based encoder with simple and effective attention modules, a weighted additive angular margin loss to handle data imbalance, a meta-learning episodic optimization to improve generalization to unseen attacks, and a disentangled adversarial training strategy using an auxiliary BN. The combination yields notable improvements on the ASVspoof 2019 LA dataset, achieving a pooled EER of 0.87% and a min t-DCF of 0.0277, demonstrating enhanced robustness against diverse and unseen spoofing attacks. The contributions include systematic evaluation of attention modules, a novel WAAM loss for binary spoofing detection, a meta-learning framework with a relation-network-based similarity measure, and a disentangled adversarial training scheme that leverages both original and adversarial data. Collectively, these methods offer a practical path to more reliable voice authentication systems under adversarial spoofing conditions.

Abstract

Advances in automatic speaker verification (ASV) promote research into the formulation of spoofing detection systems for real-world applications. The performance of ASV systems can be degraded severely by multiple types of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins and impersonation, especially in the case of unseen synthetic spoofing attacks. A reliable and robust spoofing detection system can act as a security gate to filter out spoofing attacks instead of having them reach the ASV system. A weighted additive angular margin loss is proposed to address the data imbalance issue, and different margins has been assigned to improve generalization to unseen spoofing attacks in this study. Meanwhile, we incorporate a meta-learning loss function to optimize differences between the embeddings of support versus query set in order to learn a spoofing-category-independent embedding space for utterances. Furthermore, we craft adversarial examples by adding imperceptible perturbations to spoofing speech as a data augmentation strategy, then we use an auxiliary batch normalization (BN) to guarantee that corresponding normalization statistics are performed exclusively on the adversarial examples. Additionally, A simple attention module is integrated into the residual block to refine the feature extraction process. Evaluation results on the Logical Access (LA) track of the ASVspoof 2019 corpus provides confirmation of our proposed approaches' effectiveness in terms of a pooled EER of 0.87%, and a min t-DCF of 0.0277. These advancements offer effective options to reduce the impact of spoofing attacks on voice recognition/authentication systems.

Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples

TL;DR

This work tackles the challenge of robust spoofing detection for ASV by proposing a unified framework that integrates a RawNet2-based encoder with simple and effective attention modules, a weighted additive angular margin loss to handle data imbalance, a meta-learning episodic optimization to improve generalization to unseen attacks, and a disentangled adversarial training strategy using an auxiliary BN. The combination yields notable improvements on the ASVspoof 2019 LA dataset, achieving a pooled EER of 0.87% and a min t-DCF of 0.0277, demonstrating enhanced robustness against diverse and unseen spoofing attacks. The contributions include systematic evaluation of attention modules, a novel WAAM loss for binary spoofing detection, a meta-learning framework with a relation-network-based similarity measure, and a disentangled adversarial training scheme that leverages both original and adversarial data. Collectively, these methods offer a practical path to more reliable voice authentication systems under adversarial spoofing conditions.

Abstract

Advances in automatic speaker verification (ASV) promote research into the formulation of spoofing detection systems for real-world applications. The performance of ASV systems can be degraded severely by multiple types of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins and impersonation, especially in the case of unseen synthetic spoofing attacks. A reliable and robust spoofing detection system can act as a security gate to filter out spoofing attacks instead of having them reach the ASV system. A weighted additive angular margin loss is proposed to address the data imbalance issue, and different margins has been assigned to improve generalization to unseen spoofing attacks in this study. Meanwhile, we incorporate a meta-learning loss function to optimize differences between the embeddings of support versus query set in order to learn a spoofing-category-independent embedding space for utterances. Furthermore, we craft adversarial examples by adding imperceptible perturbations to spoofing speech as a data augmentation strategy, then we use an auxiliary batch normalization (BN) to guarantee that corresponding normalization statistics are performed exclusively on the adversarial examples. Additionally, A simple attention module is integrated into the residual block to refine the feature extraction process. Evaluation results on the Logical Access (LA) track of the ASVspoof 2019 corpus provides confirmation of our proposed approaches' effectiveness in terms of a pooled EER of 0.87%, and a min t-DCF of 0.0277. These advancements offer effective options to reduce the impact of spoofing attacks on voice recognition/authentication systems.
Paper Structure (30 sections, 19 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 30 sections, 19 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Study overview.
  • Figure 2: Joint optimization scheme. All spoofing samples and embeddings are color-coded to represent different types of spoofing attacks, while genuine speech is gray. The similarity score in green denotes a match: $r_{i,j}=1$, likewise, those in red are unmatched: $r_{i,j}=0$.
  • Figure 3: Alternate data flow options between architectures with conventional BN (a) and with auxiliary BN (b)
  • Figure 4: Distribution of ASV scores and countermeasure scores.
  • Figure 5: Normalised ASV-constrained t-DCFs plot for the baseline CM system and proposed CM system on ASVspoof 2019 LA track.
  • ...and 1 more figures