Table of Contents
Fetching ...

Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning

Haibin Wu, Xu Li, Andy T. Liu, Zhiyong Wu, Helen Meng, Hung-yi Lee

TL;DR

This paper tackles the vulnerability of ASV systems to adversarial attacks by proposing a self-supervised learning-based reformer (SSLR) framework that operates without knowledge of attack algorithms. It introduces two complementary defenses: adversarial perturbation purification via cascaded SSLR models and adversarial perturbation detection via score-variation analysis across SSLR cascades, supported by a formal evaluation framework. Empirical results show substantial reductions in adversarial success rates and competitive preservation of genuine-sample performance, with ASV fine-tuning further mitigating collateral impact. The work demonstrates practical robustness gains on VoxCeleb data and provides benchmarks to guide future adversarial defenses in speaker verification.

Abstract

Previous works have shown that automatic speaker verification (ASV) is seriously vulnerable to malicious spoofing attacks, such as replay, synthetic speech, and recently emerged adversarial attacks. Great efforts have been dedicated to defending ASV against replay and synthetic speech; however, only a few approaches have been explored to deal with adversarial attacks. All the existing approaches to tackle adversarial attacks for ASV require the knowledge for adversarial samples generation, but it is impractical for defenders to know the exact attack algorithms that are applied by the in-the-wild attackers. This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms. Inspired by self-supervised learning models (SSLMs) that possess the merits of alleviating the superficial noise in the inputs and reconstructing clean samples from the interrupted ones, this work regards adversarial perturbations as one kind of noise and conducts adversarial defense for ASV by SSLMs. Specifically, we propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection. Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%. Moreover, since there is no common metric for evaluating the adversarial defense performance for ASV, this work also formalizes evaluation metrics for adversarial defense considering both purification and detection based approaches into account. We sincerely encourage future works to benchmark their approaches based on the proposed evaluation framework.

Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning

TL;DR

This paper tackles the vulnerability of ASV systems to adversarial attacks by proposing a self-supervised learning-based reformer (SSLR) framework that operates without knowledge of attack algorithms. It introduces two complementary defenses: adversarial perturbation purification via cascaded SSLR models and adversarial perturbation detection via score-variation analysis across SSLR cascades, supported by a formal evaluation framework. Empirical results show substantial reductions in adversarial success rates and competitive preservation of genuine-sample performance, with ASV fine-tuning further mitigating collateral impact. The work demonstrates practical robustness gains on VoxCeleb data and provides benchmarks to guide future adversarial defenses in speaker verification.

Abstract

Previous works have shown that automatic speaker verification (ASV) is seriously vulnerable to malicious spoofing attacks, such as replay, synthetic speech, and recently emerged adversarial attacks. Great efforts have been dedicated to defending ASV against replay and synthetic speech; however, only a few approaches have been explored to deal with adversarial attacks. All the existing approaches to tackle adversarial attacks for ASV require the knowledge for adversarial samples generation, but it is impractical for defenders to know the exact attack algorithms that are applied by the in-the-wild attackers. This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms. Inspired by self-supervised learning models (SSLMs) that possess the merits of alleviating the superficial noise in the inputs and reconstructing clean samples from the interrupted ones, this work regards adversarial perturbations as one kind of noise and conducts adversarial defense for ASV by SSLMs. Specifically, we propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection. Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%. Moreover, since there is no common metric for evaluating the adversarial defense performance for ASV, this work also formalizes evaluation metrics for adversarial defense considering both purification and detection based approaches into account. We sincerely encourage future works to benchmark their approaches based on the proposed evaluation framework.

Paper Structure

This paper contains 24 sections, 29 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: The illustration of inputs with various masking strategies. The masked part is highlighted by an orange block. (a) is the original MFCC, and (b), (c), (d) are the MFCC modified by channel masking, time masking and magnitude alteration, respectively.
  • Figure 2: (a) Illustration of the SSLR models training. The gray pixels in the altered frames (the orange block in (a) ) means masking. The green and blue frames in the predicted frames (the yellow block in (a) ), are the reconstructed frames for time masking and frequency masking respectively. (b) Adversarial defense by cascaded SSLR models. The cascaded SSLR models in (b) are only used during inference. (c) Automatic speaker verification. Notice that all $\boldsymbol{X}$, $\boldsymbol{\tilde{X}}$, $\boldsymbol{\tilde{X}'}$ and $\boldsymbol{\delta}$ in the figure are matrices that represent acoustic feature sequences.
  • Figure 3: Flow of detection. All $\boldsymbol{X}$ and {$\boldsymbol{\tilde{X}_i} \vert i = 1, ..., K$} in the figure are matrices that represent acoustic feature sequences of an utterance.
  • Figure 4: ASV system performance with different numbers of cascaded SSLR models. (a) and (b) show the AdvFAR and AdvFRR of the r-vector system, respectively. (c) and (d) show the AdvFAR and AdvFRR of the x-vector system, respectively.
  • Figure 5: Purification performance under different attacks: the AdvFAR of the r-vector system under three attacks.
  • ...and 3 more figures