Table of Contents
Fetching ...

CodecFake+: A Large-Scale Neural Audio Codec-Based Deepfake Speech Dataset

Xuanjun Chen, Jiawei Du, Haibin Wu, Lin Zhang, I-Ming Lin, I-Hsiang Chiu, Wenze Ren, Yuan Tseng, Yu Tsao, Jyh-Shing Roger Jang, Hung-yi Lee

TL;DR

CodecFake+ tackles the rising threat of codec-based deepfake speech by delivering the largest public dataset for CodecFake detection and a structured neural audio codec taxonomy. It enables multi-level analysis—codec-level, taxonomy-level, and database-level—by pairing CoRS data from 31 codec models with CoSG samples from 17 unseen models, and by evaluating detection performance via EER. The study demonstrates that training detectors on CoRS data, especially with frequency-domain decoders and disentanglement objectives, improves generalization to unseen CoSG outputs, while taxonomy-guided data selection further enhances robustness. By providing extensive data, taxonomy, and training strategies, CodecFake+ offers a practical resource to develop and evaluate anti-spoofing models against sophisticated codec-based deepfakes, with code and data to be released to the research community.

Abstract

With the rapid advancement of neural audio codecs, codec-based speech generation (CoSG) systems have become highly powerful. Unfortunately, CoSG also enables the creation of highly realistic deepfake speech, making it easier to mimic an individual's voice and spread misinformation. We refer to this emerging deepfake speech generated by CoSG systems as CodecFake. Detecting such CodecFake is an urgent challenge, yet most existing systems primarily focus on detecting fake speech generated by traditional speech synthesis models. In this paper, we introduce CodecFake+, a large-scale dataset designed to advance CodecFake detection. To our knowledge, CodecFake+ is the largest dataset encompassing the most diverse range of codec architectures. The training set is generated through re-synthesis using 31 publicly available open-source codec models, while the evaluation set includes web-sourced data from 17 advanced CoSG models. We also propose a comprehensive taxonomy that categorizes codecs by their root components: vector quantizer, auxiliary objectives, and decoder types. Our proposed dataset and taxonomy enable detailed analysis at multiple levels to discern the key factors for successful CodecFake detection. At the individual codec level, we validate the effectiveness of using codec re-synthesized speech (CoRS) as training data for large-scale CodecFake detection. At the taxonomy level, we show that detection performance is strongest when the re-synthesis model incorporates disentanglement auxiliary objectives or a frequency-domain decoder. Furthermore, from the perspective of using all the CoRS training data, we show that our proposed taxonomy can be used to select better training data for improving detection performance. Overall, we envision that CodecFake+ will be a valuable resource for both general and fine-grained exploration to develop better anti-spoofing models against CodecFake.

CodecFake+: A Large-Scale Neural Audio Codec-Based Deepfake Speech Dataset

TL;DR

CodecFake+ tackles the rising threat of codec-based deepfake speech by delivering the largest public dataset for CodecFake detection and a structured neural audio codec taxonomy. It enables multi-level analysis—codec-level, taxonomy-level, and database-level—by pairing CoRS data from 31 codec models with CoSG samples from 17 unseen models, and by evaluating detection performance via EER. The study demonstrates that training detectors on CoRS data, especially with frequency-domain decoders and disentanglement objectives, improves generalization to unseen CoSG outputs, while taxonomy-guided data selection further enhances robustness. By providing extensive data, taxonomy, and training strategies, CodecFake+ offers a practical resource to develop and evaluate anti-spoofing models against sophisticated codec-based deepfakes, with code and data to be released to the research community.

Abstract

With the rapid advancement of neural audio codecs, codec-based speech generation (CoSG) systems have become highly powerful. Unfortunately, CoSG also enables the creation of highly realistic deepfake speech, making it easier to mimic an individual's voice and spread misinformation. We refer to this emerging deepfake speech generated by CoSG systems as CodecFake. Detecting such CodecFake is an urgent challenge, yet most existing systems primarily focus on detecting fake speech generated by traditional speech synthesis models. In this paper, we introduce CodecFake+, a large-scale dataset designed to advance CodecFake detection. To our knowledge, CodecFake+ is the largest dataset encompassing the most diverse range of codec architectures. The training set is generated through re-synthesis using 31 publicly available open-source codec models, while the evaluation set includes web-sourced data from 17 advanced CoSG models. We also propose a comprehensive taxonomy that categorizes codecs by their root components: vector quantizer, auxiliary objectives, and decoder types. Our proposed dataset and taxonomy enable detailed analysis at multiple levels to discern the key factors for successful CodecFake detection. At the individual codec level, we validate the effectiveness of using codec re-synthesized speech (CoRS) as training data for large-scale CodecFake detection. At the taxonomy level, we show that detection performance is strongest when the re-synthesis model incorporates disentanglement auxiliary objectives or a frequency-domain decoder. Furthermore, from the perspective of using all the CoRS training data, we show that our proposed taxonomy can be used to select better training data for improving detection performance. Overall, we envision that CodecFake+ will be a valuable resource for both general and fine-grained exploration to develop better anti-spoofing models against CodecFake.
Paper Structure (37 sections, 7 figures, 7 tables)

This paper contains 37 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustration of different models detect CodecFake (i.e., speech generated by codec-based speech generation model) scenarios. (a) Traditional dataset: Trained on conventional TTS/VC data, the model struggles with CodecFake detection. (b) Previous CodecFake wu24p_interspeech dataset: A small-scale codec re-synthesis speech dataset for detecting EnCodec-based CodecFake. (c) CodecFake+: A large-scale dataset enables better CodecFake detection and supports multi-level analysis.
  • Figure 2: Timeline of current different types of neural audio codec models and codec-based speech generation models.
  • Figure 3: Comparison and relationship between neural audio codec and codec-based speech generation, along with their corresponding resynthesized CoRS and generated CoSG sets.
  • Figure 4: CodecFake+ Overview in Section \ref{['tab:sec_CodecFake-Omni']}.
  • Figure 5: The UTMOS scores of CodecFake+ dataset.
  • ...and 2 more figures