Table of Contents
Fetching ...

Diffusion-Based Adversarial Purification for Speaker Verification

Yibo Bai, Xiao-Lei Zhang, Xuelong Li

TL;DR

The paper addresses the vulnerability of automatic speaker verification (ASV) systems to imperceptible adversarial perturbations. It introduces Diffusion-Based Adversarial Purification (DAP), a purification framework that uses a conditional denoising diffusion probabilistic model to reconstruct clean audio before ASV inference. The approach defines a purification operator that preserves the ASV score while removing perturbations, and leverages forward diffusion and a learned reverse denoising process to recover clean speech from adversarial inputs. Experiments on VoxCeleb data show that DAP achieves state-of-the-art purification with minimal distortion to genuine speech, improving defense performance across attack types and ASV backbones.

Abstract

Recently, automatic speaker verification (ASV) based on deep learning is easily contaminated by adversarial attacks, which is a new type of attack that injects imperceptible perturbations to audio signals so as to make ASV produce wrong decisions. This poses a significant threat to the security and reliability of ASV systems. To address this issue, we propose a Diffusion-Based Adversarial Purification (DAP) method that enhances the robustness of ASV systems against such adversarial attacks. Our method leverages a conditional denoising diffusion probabilistic model to effectively purify the adversarial examples and mitigate the impact of perturbations. DAP first introduces controlled noise into adversarial examples, and then performs a reverse denoising process to reconstruct clean audio. Experimental results demonstrate the efficacy of the proposed DAP in enhancing the security of ASV and meanwhile minimizing the distortion of the purified audio signals.

Diffusion-Based Adversarial Purification for Speaker Verification

TL;DR

The paper addresses the vulnerability of automatic speaker verification (ASV) systems to imperceptible adversarial perturbations. It introduces Diffusion-Based Adversarial Purification (DAP), a purification framework that uses a conditional denoising diffusion probabilistic model to reconstruct clean audio before ASV inference. The approach defines a purification operator that preserves the ASV score while removing perturbations, and leverages forward diffusion and a learned reverse denoising process to recover clean speech from adversarial inputs. Experiments on VoxCeleb data show that DAP achieves state-of-the-art purification with minimal distortion to genuine speech, improving defense performance across attack types and ASV backbones.

Abstract

Recently, automatic speaker verification (ASV) based on deep learning is easily contaminated by adversarial attacks, which is a new type of attack that injects imperceptible perturbations to audio signals so as to make ASV produce wrong decisions. This poses a significant threat to the security and reliability of ASV systems. To address this issue, we propose a Diffusion-Based Adversarial Purification (DAP) method that enhances the robustness of ASV systems against such adversarial attacks. Our method leverages a conditional denoising diffusion probabilistic model to effectively purify the adversarial examples and mitigate the impact of perturbations. DAP first introduces controlled noise into adversarial examples, and then performs a reverse denoising process to reconstruct clean audio. Experimental results demonstrate the efficacy of the proposed DAP in enhancing the security of ASV and meanwhile minimizing the distortion of the purified audio signals.
Paper Structure (20 sections, 9 equations, 2 figures, 4 tables)

This paper contains 20 sections, 9 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: A speaker verification pipeline with DAP method. Initially, the adversarial example is introduced into a diffusion model positioned before the ASV system for processing. Subsequently, the diffusion model employs a "diffusion" process on the adversarial input, followed by the reversal of this process to reconstruct the original clean audio. Finally, the ASV system produces the correct verification outcome.
  • Figure 2: A comparison example between the original audio and its adversarial example with different defenders. The genuine example is from id10270/5r0dWxy17C8/00024.wav of VoxCeleb1. As TERA method focuses on feature-level purification, it is not included in Fig. \ref{['fig:waveform']}.