Table of Contents
Fetching ...

ELEAT-SAGA: Early & Late Integration with Evading Alternating Training for Spoof-Robust Speaker Verification

Amro Asali, Yehuda Ben-Shimol, Itshak Lapidot

TL;DR

This work targets spoofing-robust speaker verification by introducing Score Aware Gated Attention (SAGA), which modulates speaker embeddings with countermeasure scores. It systematically explores early, late, full, and fused integration strategies, and develops alternating training regimes (ATMM) and an enhanced variant ELEAT to improve generalization to unseen attacks. The proposed ELEAT-SAGA, leveraging early CM features and a bypass mechanism, achieves state-of-the-art SASV performance on ASVspoof2019 LA (SASV-EER ≈ 1.22%) and strong results on SpoofCeleb, while reducing training time. The results demonstrate that score-based gating and carefully designed training procedures can substantially improve spoofing resilience in SASV systems, with practical implications for deployable secure biometric verification.

Abstract

Spoofing-robust automatic speaker verification (SASV) seeks to build automatic speaker verification systems that are robust against both zero-effort impostor attacks and sophisticated spoofing techniques such as voice conversion (VC) and text-to-speech (TTS). In this work, we propose a novel SASV architecture that introduces score-aware gated attention (SAGA), SASV-SAGA, enabling dynamic modulation of speaker embeddings based on countermeasure (CM) scores. By integrating speaker embeddings and CM scores from pre-trained ECAPA-TDNN and AASIST models respectively, we explore several integration strategies including early, late, and full integration. We further introduce alternating training for multi-module (ATMM) and a refined variant, evading alternating training (EAT). Experimental results on the ASVspoof 2019 Logical Access (LA) and Spoofceleb datasets demonstrate significant improvements over baselines, achieving a spoofing aware speaker verification equal error rate (SASV-EER) of 1.22% and minimum normalized agnostic detection cost function (min a-DCF) of 0.0304 on the ASVspoof 2019 evaluation set. These results confirm the effectiveness of score-aware attention mechanisms and alternating training strategies in enhancing the robustness of SASV systems.

ELEAT-SAGA: Early & Late Integration with Evading Alternating Training for Spoof-Robust Speaker Verification

TL;DR

This work targets spoofing-robust speaker verification by introducing Score Aware Gated Attention (SAGA), which modulates speaker embeddings with countermeasure scores. It systematically explores early, late, full, and fused integration strategies, and develops alternating training regimes (ATMM) and an enhanced variant ELEAT to improve generalization to unseen attacks. The proposed ELEAT-SAGA, leveraging early CM features and a bypass mechanism, achieves state-of-the-art SASV performance on ASVspoof2019 LA (SASV-EER ≈ 1.22%) and strong results on SpoofCeleb, while reducing training time. The results demonstrate that score-based gating and carefully designed training procedures can substantially improve spoofing resilience in SASV systems, with practical implications for deployable secure biometric verification.

Abstract

Spoofing-robust automatic speaker verification (SASV) seeks to build automatic speaker verification systems that are robust against both zero-effort impostor attacks and sophisticated spoofing techniques such as voice conversion (VC) and text-to-speech (TTS). In this work, we propose a novel SASV architecture that introduces score-aware gated attention (SAGA), SASV-SAGA, enabling dynamic modulation of speaker embeddings based on countermeasure (CM) scores. By integrating speaker embeddings and CM scores from pre-trained ECAPA-TDNN and AASIST models respectively, we explore several integration strategies including early, late, and full integration. We further introduce alternating training for multi-module (ATMM) and a refined variant, evading alternating training (EAT). Experimental results on the ASVspoof 2019 Logical Access (LA) and Spoofceleb datasets demonstrate significant improvements over baselines, achieving a spoofing aware speaker verification equal error rate (SASV-EER) of 1.22% and minimum normalized agnostic detection cost function (min a-DCF) of 0.0304 on the ASVspoof 2019 evaluation set. These results confirm the effectiveness of score-aware attention mechanisms and alternating training strategies in enhancing the robustness of SASV systems.
Paper Structure (35 sections, 6 equations, 4 figures, 8 tables, 2 algorithms)

This paper contains 35 sections, 6 equations, 4 figures, 8 tables, 2 algorithms.

Figures (4)

  • Figure 1: Diagram of the eFusion SASV system. Dashed arrows denote operations exclusive to training stage.
  • Figure 2: Diagram of the proposed SAGA system, illustrating various CM score integration strategies. Dashed arrows denote operations exclusive to the training stage.
  • Figure 3: Diagram of the proposed ELEAT-SAGA system, illustrating branched feature extraction, and the CM bypass available only during ASV module training.
  • Figure 4: UMAP visualization of embeddings from ASVspoof2019 Train, Dev, and Eval sets. Each row corresponds to one dataset split, while the columns show embeddings extracted from AASIST (left), ELEAT-CM (middle), and ELEAT-SASV (right).

Theorems & Definitions (1)

  • Definition 2.1: SASV