Where are we in audio deepfake detection? A systematic analysis over generative and detection models

Xiang Li; Pin-Yu Chen; Wenqi Wei

Where are we in audio deepfake detection? A systematic analysis over generative and detection models

Xiang Li, Pin-Yu Chen, Wenqi Wei

TL;DR

The paper presents SONAR, a unified framework and benchmark for evaluating AI-synthesized audio detectors across traditional and foundation-model architectures using a diverse dataset from nine synthesis platforms. It demonstrates that speech foundation models generalize better across datasets and languages than traditional detectors, with model size and pretraining data playing key roles. Few-shot fine-tuning can markedly improve generalization for targeted scenarios but may risk catastrophic forgetting if overapplied. The work underscores the need for benchmarks that track advances in TTS and VC technologies to develop robust detectors and informs practical deployment via insights on cross-lingual performance and model scaling.

Abstract

Recent advances in Text-to-Speech (TTS) and Voice-Conversion (VC) using generative Artificial Intelligence (AI) technology have made it possible to generate high-quality and realistic human-like audio. This poses growing challenges in distinguishing AI-synthesized speech from the genuine human voice and could raise concerns about misuse for impersonation, fraud, spreading misinformation, and scams. However, existing detection methods for AI-synthesized audio have not kept pace and often fail to generalize across diverse datasets. In this paper, we introduce SONAR, a synthetic AI-Audio Detection Framework and Benchmark, aiming to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content. SONAR includes a novel evaluation dataset sourced from 9 diverse audio synthesis platforms, including leading TTS providers and state-of-the-art TTS models. It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems. Through extensive experiments, (1) we reveal the limitations of existing detection methods and demonstrate that foundation models exhibit stronger generalization capabilities, likely due to their model size and the scale and quality of pretraining data. (2) Speech foundation models demonstrate robust cross-lingual generalization capabilities, maintaining strong performance across diverse languages despite being fine-tuned solely on English speech data. This finding also suggests that the primary challenges in audio deepfake detection are more closely tied to the realism and quality of synthetic audio rather than language-specific characteristics. (3) We explore the effectiveness and efficiency of few-shot fine-tuning in improving generalization, highlighting its potential for tailored applications, such as personalized detection systems for specific entities or individuals.

Where are we in audio deepfake detection? A systematic analysis over generative and detection models

TL;DR

Abstract

Where are we in audio deepfake detection? A systematic analysis over generative and detection models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)