Table of Contents
Fetching ...

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

Yuankun Xie, Yi Lu, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun

TL;DR

The paper tackles the challenge of detecting ALM-based deepfake audio, which often uses neural codecs rather than traditional vocoders. It introduces Codecfake, a large-scale, multilingual dataset with over 1 million codec-based fake samples across seven neural codecs, and a generalized training strategy CSAM to improve domain-general detection. Experiments show that detectors trained only on vocoder-based data fail on codec-based ALM audio, while codec-trained and CSAM-based co-training yield substantially lower equal error rates, including 0.616% average EER across diverse test conditions. The work demonstrates the importance of codec-aware data and training strategies for universal deepfake audio detection and provides online access to dataset and code for reproducibility and further research.

Abstract

With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially constructed the Codecfake dataset, an open-source, large-scale collection comprising over 1 million audio samples in both English and Chinese, focus on ALM-based audio detection. As countermeasure, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original sharpness aware minimization (SAM), we propose the CSAM strategy to learn a domain balanced and generalized minima. In our experiments, we first demonstrate that ADD model training with the Codecfake dataset can effectively detects ALM-based audio. Furthermore, our proposed generalization countermeasure yields the lowest average equal error rate (EER) of 0.616% across all test conditions compared to baseline models. The dataset and associated code are available online.

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

TL;DR

The paper tackles the challenge of detecting ALM-based deepfake audio, which often uses neural codecs rather than traditional vocoders. It introduces Codecfake, a large-scale, multilingual dataset with over 1 million codec-based fake samples across seven neural codecs, and a generalized training strategy CSAM to improve domain-general detection. Experiments show that detectors trained only on vocoder-based data fail on codec-based ALM audio, while codec-trained and CSAM-based co-training yield substantially lower equal error rates, including 0.616% average EER across diverse test conditions. The work demonstrates the importance of codec-aware data and training strategies for universal deepfake audio detection and provides online access to dataset and code for reproducibility and further research.

Abstract

With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially constructed the Codecfake dataset, an open-source, large-scale collection comprising over 1 million audio samples in both English and Chinese, focus on ALM-based audio detection. As countermeasure, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original sharpness aware minimization (SAM), we propose the CSAM strategy to learn a domain balanced and generalized minima. In our experiments, we first demonstrate that ADD model training with the Codecfake dataset can effectively detects ALM-based audio. Furthermore, our proposed generalization countermeasure yields the lowest average equal error rate (EER) of 0.616% across all test conditions compared to baseline models. The dataset and associated code are available online.
Paper Structure (26 sections, 6 equations, 6 figures, 10 tables)

This paper contains 26 sections, 6 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: ALM-based deepfake audio. The inner circle represents different audio types, while the outer circle represents different ALM-based generation methods, with different colors indicating the use of different codec methods.
  • Figure 2: Mel-spectrogram of the original audio alongside 7 codec-based audio samples generated from the original. The top row is generated from VCTK, and the bottom row is generated from AISHELL3.
  • Figure 3: Partitions and construction of Codecfake dataset. Left part displays the training set, development set, and evaluation set of Codecfake dataset. The right part illustrates the diverse testing conditions of Codecfake.
  • Figure 4: Four baseline ADD models for evaluation. (a), (b), (c), (d) represents Mel-LCNN, W2V2-LCNN, WavLM-AASIST, W2V2-AASIST, respectively.
  • Figure 5: The confusion matrices under different test conditions. (a), (b) correspond to W2V2-AASIST trained on the 19LA training set and tested on 19LA and C7. (c), (d) correspond to W2V2-AASIST trained on the Codecfake training set and tested on 19LA and C7.
  • ...and 1 more figures