Table of Contents
Fetching ...

What Counts as Real? Speech Restoration and Voice Quality Conversion Pose New Challenges to Deepfake Detection

Shree Harsha Bokkahalli Satish, Harm Lameris, Joakim Gustafson, Éva Székely

Abstract

Audio anti-spoofing systems are typically formulated as binary classifiers distinguishing bona fide from spoofed speech. This assumption fails under layered generative processing, where benign transformations introduce distributional shifts that are misclassified as spoofing. We show that phonation-modifying voice conversion and speech restoration are treated as out-of-distribution despite preserving speaker authenticity. Using a multi-class setup separating bona fide, converted, spoofed, and converted-spoofed speech, we analyse model behaviour through self-supervised learning (SSL) embeddings and acoustic correlates. The benign transformations induce a drift in the SSL space, compressing bona fide and spoofed speech and reducing classifier separability. Reformulating anti-spoofing as a multi-class problem improves robustness to benign shifts while preserving spoof detection, suggesting binary systems model the distribution of raw speech rather than authenticity itself.

What Counts as Real? Speech Restoration and Voice Quality Conversion Pose New Challenges to Deepfake Detection

Abstract

Audio anti-spoofing systems are typically formulated as binary classifiers distinguishing bona fide from spoofed speech. This assumption fails under layered generative processing, where benign transformations introduce distributional shifts that are misclassified as spoofing. We show that phonation-modifying voice conversion and speech restoration are treated as out-of-distribution despite preserving speaker authenticity. Using a multi-class setup separating bona fide, converted, spoofed, and converted-spoofed speech, we analyse model behaviour through self-supervised learning (SSL) embeddings and acoustic correlates. The benign transformations induce a drift in the SSL space, compressing bona fide and spoofed speech and reducing classifier separability. Reformulating anti-spoofing as a multi-class problem improves robustness to benign shifts while preserving spoof detection, suggesting binary systems model the distribution of raw speech rather than authenticity itself.
Paper Structure (12 sections, 2 equations, 4 figures, 2 tables)

This paper contains 12 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: t-SNE plots of Wav2Vec2 embeddings before and after VQC on MLAAD matched dataset.
  • Figure 2: t-SNE plots of Wav2Vec2 embeddings before and after Sidon enhancement on MLAAD matched dataset.
  • Figure 3: Directional consistency of VQC-induced embedding shifts between mean bona fide and spoofed shift vectors. Whisper shows lower or negative values, suggesting potential source-dependent shift directions.
  • Figure 4: The acoustic feature shifts between the original M-AILABS/MLAAD recordings and the converted recordings by voice quality.