The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

Yuankun Xie; Yi Lu; Ruibo Fu; Zhengqi Wen; Zhiyong Wang; Jianhua Tao; Xin Qi; Xiaopeng Wang; Yukun Liu; Haonan Cheng; Long Ye; Yi Sun

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

Yuankun Xie, Yi Lu, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun

TL;DR

The paper tackles the challenge of detecting ALM-based deepfake audio, which often uses neural codecs rather than traditional vocoders. It introduces Codecfake, a large-scale, multilingual dataset with over 1 million codec-based fake samples across seven neural codecs, and a generalized training strategy CSAM to improve domain-general detection. Experiments show that detectors trained only on vocoder-based data fail on codec-based ALM audio, while codec-trained and CSAM-based co-training yield substantially lower equal error rates, including 0.616% average EER across diverse test conditions. The work demonstrates the importance of codec-aware data and training strategies for universal deepfake audio detection and provides online access to dataset and code for reproducibility and further research.

Abstract

With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially constructed the Codecfake dataset, an open-source, large-scale collection comprising over 1 million audio samples in both English and Chinese, focus on ALM-based audio detection. As countermeasure, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original sharpness aware minimization (SAM), we propose the CSAM strategy to learn a domain balanced and generalized minima. In our experiments, we first demonstrate that ADD model training with the Codecfake dataset can effectively detects ALM-based audio. Furthermore, our proposed generalization countermeasure yields the lowest average equal error rate (EER) of 0.616% across all test conditions compared to baseline models. The dataset and associated code are available online.

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

TL;DR

Abstract

Paper Structure (26 sections, 6 equations, 6 figures, 10 tables)

This paper contains 26 sections, 6 equations, 6 figures, 10 tables.

Introduction
Related Work
Vocoder-based deepfake audio
Codec-based deepfake audio
Audio Deepfake Detection Dataset
Dataset Design
Overview
Architectures for generating codec-based fake audio
The Generation Process of codec-based fake audio
Overall Statistics
Testing Condition
Audio Deepfake Detection
Baseline model
Countermeasure for generalized ADD method
Experiments
...and 11 more sections

Figures (6)

Figure 1: ALM-based deepfake audio. The inner circle represents different audio types, while the outer circle represents different ALM-based generation methods, with different colors indicating the use of different codec methods.
Figure 2: Mel-spectrogram of the original audio alongside 7 codec-based audio samples generated from the original. The top row is generated from VCTK, and the bottom row is generated from AISHELL3.
Figure 3: Partitions and construction of Codecfake dataset. Left part displays the training set, development set, and evaluation set of Codecfake dataset. The right part illustrates the diverse testing conditions of Codecfake.
Figure 4: Four baseline ADD models for evaluation. (a), (b), (c), (d) represents Mel-LCNN, W2V2-LCNN, WavLM-AASIST, W2V2-AASIST, respectively.
Figure 5: The confusion matrices under different test conditions. (a), (b) correspond to W2V2-AASIST trained on the 19LA training set and tested on 19LA and C7. (c), (d) correspond to W2V2-AASIST trained on the Codecfake training set and tested on 19LA and C7.
...and 1 more figures

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

TL;DR

Abstract

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

Authors

TL;DR

Abstract

Table of Contents

Figures (6)