Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Xuechen Liu; Md Sahidullah; Kong Aik Lee; Tomi Kinnunen

Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen

TL;DR

This work tackles spoofing threats in speaker verification by proposing Generalizing SASV (G-SASV), a backend-centric approach that eliminates the need for a separate spoof countermeasure at authentication time. It leverages limited training data from spoofing countermeasures and domain adaptation, along with multi-task learning that injects spoof-related information into the embedding space, to produce a three-class posterior without requiring CM during testing. The proposed framework achieves notable gains over baselines on ASVspoof 2019 LA, including up to 36.2% joint EER and 49.8% spoof EER improvements, by combining a simple neural backend with spoof embeddings and metadata attributes. These results demonstrate the practical potential of DA and spoof-informed multi-task training to enhance spoof robustness in a single ASV system, with implications for reduced computation and improved generalization in real-world deployments.

Abstract

It is now well-known that automatic speaker verification (ASV) systems can be spoofed using various types of adversaries. The usual approach to counteract ASV systems against such attacks is to develop a separate spoofing countermeasure (CM) module to classify speech input either as a bonafide, or a spoofed utterance. Nevertheless, such a design requires additional computation and utilization efforts at the authentication stage. An alternative strategy involves a single monolithic ASV system designed to handle both zero-effort imposter (non-targets) and spoofing attacks. Such spoof-aware ASV systems have the potential to provide stronger protections and more economic computations. To this end, we propose to generalize the standalone ASV (G-SASV) against spoofing attacks, where we leverage limited training data from CM to enhance a simple backend in the embedding space, without the involvement of a separate CM module during the test (authentication) phase. We propose a novel yet simple backend classifier based on deep neural networks and conduct the study via domain adaptation and multi-task integration of spoof embeddings at the training stage. Experiments are conducted on the ASVspoof 2019 logical access dataset, where we improve the performance of statistical ASV backends on the joint (bonafide and spoofed) and spoofed conditions by a maximum of 36.2% and 49.8% in terms of equal error rates, respectively.

Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

TL;DR

Abstract

Paper Structure (40 sections, 8 equations, 9 figures, 8 tables)

This paper contains 40 sections, 8 equations, 9 figures, 8 tables.

Introduction
Problem Formulation
Spoof Countermeasure
Joint Optimization
Spoof-Aware Speaker Verification
Generalizing Speaker Verification
The source and type of the input
The spoof-aware ASV classifier
The decision policy
Domain Adaptation for Generalizing ASV Against Spoofing
Network-wise Adaptation
Structural Transformation on ReLU
Spoof Integration for Generalizing ASV Against Spoofing
Type of spoof features
Synthetic spoof embeddings
...and 25 more sections

Figures (9)

Figure 1: A demonstration of existing approaches to improve spoofing robustness of ASV systems ((a) to (c)), with the proposed approach ((d)). Modules bearing "Frontend" in their name extract embeddings, whereas modules labeled with "scoring" provide decision scores based on the input(s). Modules featuring a lock symbol are sourced from pre-trained models and, as such, remain non-trainable during the training process.
Figure 2: The framework of generalizing SASV against the spoofing attacks. The sub-figures demonstrates G-SASV at conceptual and practical level, respectively. The right-hand side outlines the baseline system. The three posterior probabilities represent the three classes respectively: bonafide target $C_\text{tar}$, bonafide non-target/impostor $C_\text{non}$, and spoof target $C_\text{spf}$. $\Theta$ denotes the set of parameters of the spoof-aware ASV classifier. $\mathcal{L}$ is the training loss function. The decision-making module is utilized for performing analysis analogously to anti-spoofing and joint optimization systems. The dashed line indicates the step which is discarded at the evaluation stage. The dotted lines indicate steps that are only executed during the evaluation stage.
Figure 3: The process of generating the meta attribute vector $\mathbf{\phi}_\text{attr}$ according to the type of attack.
Figure 4: The multi-task learning schemes for the separate regression branch. The dashed lines indicate steps that are discarded at evaluation. The target distribution of the regression branch $\mathbf{\phi}_\text{reg}$ can be either $\mathbf{\phi}_{\text{spoof}}$ or $[\mathbf{\phi}_{\text{spoof}}, \mathbf{\phi}_{\text{attr}}]$.
Figure 5: The multi-task learning schemes use auxiliary branches with regression and classification with meta attributes as labels ($\mathbf{\phi}_\text{attr}$). The dashed lines indicate steps which are discarded at evaluation.
...and 4 more figures

Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

TL;DR

Abstract

Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Authors

TL;DR

Abstract

Table of Contents

Figures (9)