Table of Contents
Fetching ...

AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds

Qizhou Wang, Hanxun Huang, Guansong Pang, Sarah Erfani, Christopher Leckie

TL;DR

AUDETER tackles the open-world deepfake audio detection problem by providing a large-scale, highly diverse dataset that pairs real speech with synthetic variants from 21 synthesis systems. The authors demonstrate that existing detectors struggle to generalize to unseen synthesis and real-speech shifts, and they show that training on AUDETER improves cross-domain performance, with XLR-based detectors achieving strong results (e.g., EER of $1.87\%$ on In-the-Wild). They identify a limitation of binary classification when training with diverse deepfake patterns and propose a curriculum-learning framework that learns system-invariant representations and uses teacher–student distillation to incorporate diverse patterns. The work highlights the value of data-centric approaches for open-world detection and sets AUDETER as a resource to drive future improvements in robust, generalizable deepfake audio detectors.

Abstract

Speech synthesis systems can now produce highly realistic vocalisations that pose significant authenticity challenges. Despite substantial progress in deepfake detection models, their real-world effectiveness is often undermined by evolving distribution shifts between training and test data, driven by the complexity of human speech and the rapid evolution of synthesis systems. Existing datasets suffer from limited real speech diversity, insufficient coverage of recent synthesis systems, and heterogeneous mixtures of deepfake sources, which hinder systematic evaluation and open-world model training. To address these issues, we introduce AUDETER (AUdio DEepfake TEst Range), a large-scale and highly diverse deepfake audio dataset comprising over 4,500 hours of synthetic audio generated by 11 recent TTS models and 10 vocoders, totalling 3 million clips. We further observe that most existing detectors default to binary supervised training, which can induce negative transfer across synthesis sources when the training data contains highly diverse deepfake patterns, impacting overall generalisation. As a complementary contribution, we propose an effective curriculum-learning-based approach to mitigate this effect. Extensive experiments show that existing detection models struggle to generalise to novel deepfakes and human speech in AUDETER, whereas XLR-based detectors trained on AUDETER achieve strong cross-domain performance across multiple benchmarks, achieving an EER of 1.87% on In-the-Wild. AUDETER is available on GitHub.

AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds

TL;DR

AUDETER tackles the open-world deepfake audio detection problem by providing a large-scale, highly diverse dataset that pairs real speech with synthetic variants from 21 synthesis systems. The authors demonstrate that existing detectors struggle to generalize to unseen synthesis and real-speech shifts, and they show that training on AUDETER improves cross-domain performance, with XLR-based detectors achieving strong results (e.g., EER of on In-the-Wild). They identify a limitation of binary classification when training with diverse deepfake patterns and propose a curriculum-learning framework that learns system-invariant representations and uses teacher–student distillation to incorporate diverse patterns. The work highlights the value of data-centric approaches for open-world detection and sets AUDETER as a resource to drive future improvements in robust, generalizable deepfake audio detectors.

Abstract

Speech synthesis systems can now produce highly realistic vocalisations that pose significant authenticity challenges. Despite substantial progress in deepfake detection models, their real-world effectiveness is often undermined by evolving distribution shifts between training and test data, driven by the complexity of human speech and the rapid evolution of synthesis systems. Existing datasets suffer from limited real speech diversity, insufficient coverage of recent synthesis systems, and heterogeneous mixtures of deepfake sources, which hinder systematic evaluation and open-world model training. To address these issues, we introduce AUDETER (AUdio DEepfake TEst Range), a large-scale and highly diverse deepfake audio dataset comprising over 4,500 hours of synthetic audio generated by 11 recent TTS models and 10 vocoders, totalling 3 million clips. We further observe that most existing detectors default to binary supervised training, which can induce negative transfer across synthesis sources when the training data contains highly diverse deepfake patterns, impacting overall generalisation. As a complementary contribution, we propose an effective curriculum-learning-based approach to mitigate this effect. Extensive experiments show that existing detection models struggle to generalise to novel deepfakes and human speech in AUDETER, whereas XLR-based detectors trained on AUDETER achieve strong cross-domain performance across multiple benchmarks, achieving an EER of 1.87% on In-the-Wild. AUDETER is available on GitHub.

Paper Structure

This paper contains 61 sections, 2 equations, 8 figures, 38 tables, 1 algorithm.

Figures (8)

  • Figure 1: UMAP visualisation of last-layer audio representations from an XLR+R+A wra model trained on ASVSpoof and evaluated on AUDETER. Colours indicate normalised real-class likelihood. The strong overlap between real samples and five deepfake types, including many real samples in high-spoof regions, showing limited generalisation and both false positives and false negatives.
  • Figure 2: UMAP visualisation of real and synthetic audio samples. AUDETER captures more diverse real speech patterns and a broader range of synthetic variations compared to existing datasets.
  • Figure 3: WER similarity (a) and MOS score (b) for the US Congress Subset.
  • Figure 4: Open-world detection performance of the baseline methods on CrowdSource (CS) and US Congress (UC) subsets.
  • Figure 5: Average performance of top 3 EER XLR+R+A models trained with varying data proportions.
  • ...and 3 more figures