Table of Contents
Fetching ...

On Deepfake Voice Detection -- It's All in the Presentation

Héctor Delgado, Giorgio Ramondetti, Emanuele Dalmasso, Gennady Karvitsky, Daniele Colibro, Haydar Talib

TL;DR

This work argues that current deepfake voice-detection research suffers from a lack of realism, limiting real-world applicability. It introduces a holistic data-generation framework that models the full spoof attack sequence, including base, presented, real-world, and augmented data, and demonstrates that realism in data substantially improves generalization—often more than increasing model size. Through extensive evaluation of three spoof-detection frontends across varied training regimes, the authors show that incorporating presentation and real-world conditions yields large accuracy gains (up to 57% on real-world benchmarks) and that lightweight, data-rich approaches can rival larger models. The study emphasizes data collection and realistic benchmarks as the key lever for practical deepfake defenses, with broad implications for dataset design and evaluation in the audio-forensics domain.

Abstract

While the technologies empowering malicious audio deepfakes have dramatically evolved in recent years due to generative AI advances, the same cannot be said of global research into spoofing (deepfake) countermeasures. This paper highlights how current deepfake datasets and research methodologies led to systems that failed to generalize to real world application. The main reason is due to the difference between raw deepfake audio, and deepfake audio that has been presented through a communication channel, e.g. by phone. We propose a new framework for data creation and research methodology, allowing for the development of spoofing countermeasures that would be more effective in real-world scenarios. By following the guidelines outlined here we improved deepfake detection accuracy by 39% in more robust and realistic lab setups, and by 57% on a real-world benchmark. We also demonstrate how improvement in datasets would have a bigger impact on deepfake detection accuracy than the choice of larger SOTA models would over smaller models; that is, it would be more important for the scientific community to make greater investment on comprehensive data collection programs than to simply train larger models with higher computational demands.

On Deepfake Voice Detection -- It's All in the Presentation

TL;DR

This work argues that current deepfake voice-detection research suffers from a lack of realism, limiting real-world applicability. It introduces a holistic data-generation framework that models the full spoof attack sequence, including base, presented, real-world, and augmented data, and demonstrates that realism in data substantially improves generalization—often more than increasing model size. Through extensive evaluation of three spoof-detection frontends across varied training regimes, the authors show that incorporating presentation and real-world conditions yields large accuracy gains (up to 57% on real-world benchmarks) and that lightweight, data-rich approaches can rival larger models. The study emphasizes data collection and realistic benchmarks as the key lever for practical deepfake defenses, with broad implications for dataset design and evaluation in the audio-forensics domain.

Abstract

While the technologies empowering malicious audio deepfakes have dramatically evolved in recent years due to generative AI advances, the same cannot be said of global research into spoofing (deepfake) countermeasures. This paper highlights how current deepfake datasets and research methodologies led to systems that failed to generalize to real world application. The main reason is due to the difference between raw deepfake audio, and deepfake audio that has been presented through a communication channel, e.g. by phone. We propose a new framework for data creation and research methodology, allowing for the development of spoofing countermeasures that would be more effective in real-world scenarios. By following the guidelines outlined here we improved deepfake detection accuracy by 39% in more robust and realistic lab setups, and by 57% on a real-world benchmark. We also demonstrate how improvement in datasets would have a bigger impact on deepfake detection accuracy than the choice of larger SOTA models would over smaller models; that is, it would be more important for the scientific community to make greater investment on comprehensive data collection programs than to simply train larger models with higher computational demands.

Paper Structure

This paper contains 8 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The sequence of steps transforming the voice signal is more elaborate in real-world application (telephone banking call center in this example) than simply using the source deepfake audio file created by TTS tool (a). Instead, the fraudster presents the deepfake voice through the phone using a loudspeaker or via direct injection technique. This audio is then transmitted through the telephony network, completing the presentation phase (b). In the task phase, we arrive at the real-world setting - a phone call takes place to engage the bank's call center agent (c). The feedback arrow indicates that the fraudster uses the deepfake tool with the end task in mind, e.g. by creating conversational phrases. Each phase of this sequence introduces one or more distortions to the source deepfake audio signal
  • Figure 2: (a) The ResNet-CoT system for spoof detection and (b) The two variants of Res-CoT blocks used by the model (encoded by dark and light blue colors in (a)).
  • Figure 3: Plot of accuracy in terms of MDR for FAR=1% of deepfake detection systems logmel-ResNet-CoT, WavLM-LLGF and WavLM-Nes2Net over the training conditions Base, Base+Augmented, Base+Presented, Base+Presented+Augmented (color-coded), on the testing conditions Base, Realworld/injection, and Realworld/playback