Table of Contents
Fetching ...

Discrete optimal transport is a strong audio adversarial attack

Anton Selitskiy, Akib Shahriyar, Jishnuraj Prakasan

TL;DR

Comparison with $k$NN-VC, SinkVC, and Gaussian optimal transport (MKL) demonstrates stronger domain adaptation abilities of the discrete optimal transport voice conversion method, andlation analysis indicates that distribution-level alignment is a powerful and stable attack for deployed CMs.

Abstract

In this paper, we introduce the discrete optimal transport voice conversion ($k$DOT-VC) method. Comparison with $k$NN-VC, SinkVC, and Gaussian optimal transport (MKL) demonstrates stronger domain adaptation abilities of our method. We use the probabilistic nature of optimal transport (OT) and show that $k$DOT-VC is an effective black-box adversarial attack against modern audio anti-spoofing countermeasures (CMs). Our attack operates as a post-processing, distribution-alignment step: frame-level {WavLM} embeddings of generated speech are aligned to an unpaired bona fide pool via entropic OT and a top-$k$ barycentric projection, then decoded with a neural vocoder. Ablation analysis indicates that distribution-level alignment is a powerful and stable attack for deployed CMs.

Discrete optimal transport is a strong audio adversarial attack

TL;DR

Comparison with NN-VC, SinkVC, and Gaussian optimal transport (MKL) demonstrates stronger domain adaptation abilities of the discrete optimal transport voice conversion method, andlation analysis indicates that distribution-level alignment is a powerful and stable attack for deployed CMs.

Abstract

In this paper, we introduce the discrete optimal transport voice conversion (DOT-VC) method. Comparison with NN-VC, SinkVC, and Gaussian optimal transport (MKL) demonstrates stronger domain adaptation abilities of our method. We use the probabilistic nature of optimal transport (OT) and show that DOT-VC is an effective black-box adversarial attack against modern audio anti-spoofing countermeasures (CMs). Our attack operates as a post-processing, distribution-alignment step: frame-level {WavLM} embeddings of generated speech are aligned to an unpaired bona fide pool via entropic OT and a top- barycentric projection, then decoded with a neural vocoder. Ablation analysis indicates that distribution-level alignment is a powerful and stable attack for deployed CMs.

Paper Structure

This paper contains 14 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Schematic overview of the DOT-based voice conversion attack pipeline.
  • Figure 2: Bona fide embeddings from LibriSpeech.
  • Figure 3: Bona fide embeddings from ASVspoof2019.