Malacopula: adversarial automatic speaker verification attacks using a neural-based generalised Hammerstein model

Massimiliano Todisco; Michele Panariello; Xin Wang; Héctor Delgado; Kong Aik Lee; Nicholas Evans

Malacopula: adversarial automatic speaker verification attacks using a neural-based generalised Hammerstein model

Massimiliano Todisco, Michele Panariello, Xin Wang, Héctor Delgado, Kong Aik Lee, Nicholas Evans

TL;DR

The paper addresses the vulnerability of automatic speaker verification (ASV) systems to adversarial spoofing by introducing Malacopula, a neural-based generalised Hammerstein post-processing filter. Malacopula employs a multi-branch, non-linear processing chain to perturb amplitude, phase, and frequency content of spoofed speech, optimizing to minimize the cosine distance between speaker embeddings of modified and genuine utterances, via $\min_{\mathbf{c}^{(s,a)}_{K,L}} [ 1 - CS( f_A( MC^{(s,a)}_{K,L}(\mathbf{x}) ), f_A(\mathbf{y}) ) ]$ and selecting filters with a Wasserstein-distance criterion across embeddings. Experiments on three ASV systems (CAM++, ECAPA, ERes2Net) using the ASVspoof 2019 LA dataset show that Malacopula increases vulnerability to spoofing across architectures, with more pronounced effects for certain attacks, while speech quality degrades (lower MOS) and existing detectors like AASIST can still detect many Malacopula-perturbed inputs under controlled conditions. The work highlights a need for stronger defenses and more realistic evaluations in unconstrained environments, as well as ongoing exploration of non-linear adversarial attacks in speech security. The formulated approach demonstrates the feasibility of cross-system attack transfer and emphasizes the practical risk of adversarial perturbations in real-world ASV deployments.

Abstract

We present Malacopula, a neural-based generalised Hammerstein model designed to introduce adversarial perturbations to spoofed speech utterances so that they better deceive automatic speaker verification (ASV) systems. Using non-linear processes to modify speech utterances, Malacopula enhances the effectiveness of spoofing attacks. The model comprises parallel branches of polynomial functions followed by linear time-invariant filters. The adversarial optimisation procedure acts to minimise the cosine distance between speaker embeddings extracted from spoofed and bona fide utterances. Experiments, performed using three recent ASV systems and the ASVspoof 2019 dataset, show that Malacopula increases vulnerabilities by a substantial margin. However, speech quality is reduced and attacks can be detected effectively under controlled conditions. The findings emphasise the need to identify new vulnerabilities and design defences to protect ASV systems from adversarial attacks in the wild.

Malacopula: adversarial automatic speaker verification attacks using a neural-based generalised Hammerstein model

TL;DR

and selecting filters with a Wasserstein-distance criterion across embeddings. Experiments on three ASV systems (CAM++, ECAPA, ERes2Net) using the ASVspoof 2019 LA dataset show that Malacopula increases vulnerability to spoofing across architectures, with more pronounced effects for certain attacks, while speech quality degrades (lower MOS) and existing detectors like AASIST can still detect many Malacopula-perturbed inputs under controlled conditions. The work highlights a need for stronger defenses and more realistic evaluations in unconstrained environments, as well as ongoing exploration of non-linear adversarial attacks in speech security. The formulated approach demonstrates the feasibility of cross-system attack transfer and emphasizes the practical risk of adversarial perturbations in real-world ASV deployments.

Abstract

Paper Structure (13 sections, 4 equations, 6 figures)

This paper contains 13 sections, 4 equations, 6 figures.

Introduction
Literature Review
Generalised Hammerstein Model
Malacopula
Malacopula filter architecture
Adversarial Optimisation Procedure
Experimental Setup
Database, protocols and filter optimisation
Implementation
Metrics
Experimental Results
Conclusions
Acknowledgements

Figures (6)

Figure 1: Malacopula filter architecture based on the generalised Hammerstein model. The blue box represents the linear component, while the the orange dashed box represents the non-linear filter components.
Figure 2: During training, Malacopula filters are optimised with the speaker embedding extractor $f_A(\cdot)$ as denoted by Equation \ref{['eq:objective']}. To ensure generalisation across different speakers, the best Malacopula filter is selected using another speaker embedding extractor $f_B(\cdot)$. The selection is based on the minimum Wasserstein distance between the following two score distributions: (i) the cosine distance between spoofed utterances processed by the Malacopula filter $MC(\mathbf{X})$ and bona fide enrolment utterances $\mathbf{y}$, and (ii) the cosine distance between bona fide target utterances $\mathbf{Z}$ and bona fide enrolment utterances $\mathbf{y}$. If multiple enrolment utterances are available, we use the average enrolment embedding.
Figure 3: Pooled spf-EER for baseline spoof and Malacopula filtered spoof attacks for four different ASV systems.
Figure 4: spf-EER per attacks for baseline spoof and Malacopula 257-5 filtered spoof attacks of three ASV systems.
Figure 5: spf-EER per attacks for baseline spoof and Malacopula 257-5 filtered spoof attacks of three ASV systems.
...and 1 more figures

Malacopula: adversarial automatic speaker verification attacks using a neural-based generalised Hammerstein model

TL;DR

Abstract

Malacopula: adversarial automatic speaker verification attacks using a neural-based generalised Hammerstein model

Authors

TL;DR

Abstract

Table of Contents

Figures (6)