The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio
Yuankun Xie, Yi Lu, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun
TL;DR
The paper tackles the challenge of detecting ALM-based deepfake audio, which often uses neural codecs rather than traditional vocoders. It introduces Codecfake, a large-scale, multilingual dataset with over 1 million codec-based fake samples across seven neural codecs, and a generalized training strategy CSAM to improve domain-general detection. Experiments show that detectors trained only on vocoder-based data fail on codec-based ALM audio, while codec-trained and CSAM-based co-training yield substantially lower equal error rates, including 0.616% average EER across diverse test conditions. The work demonstrates the importance of codec-aware data and training strategies for universal deepfake audio detection and provides online access to dataset and code for reproducibility and further research.
Abstract
With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially constructed the Codecfake dataset, an open-source, large-scale collection comprising over 1 million audio samples in both English and Chinese, focus on ALM-based audio detection. As countermeasure, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original sharpness aware minimization (SAM), we propose the CSAM strategy to learn a domain balanced and generalized minima. In our experiments, we first demonstrate that ADD model training with the Codecfake dataset can effectively detects ALM-based audio. Furthermore, our proposed generalization countermeasure yields the lowest average equal error rate (EER) of 0.616% across all test conditions compared to baseline models. The dataset and associated code are available online.
