Table of Contents
Fetching ...

CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems

Haibin Wu, Yuan Tseng, Hung-yi Lee

TL;DR

The paper addresses impersonation risks posed by codec-based speech synthesis that can clone an unseen speaker from a few seconds of audio. It introduces CodecFake, a dataset assembled from 15 codec-model subsets across 6 frameworks by encoding real speech into discrete codes and decoding via multiple codec decoders. Experiments show ASVspoof baselines fail on codec-based deepfakes, while training on CodecFake yields strong detection, achieving $0.4\%$ EER on VALL-E and related demos. This work provides a practical benchmark and dataset to guide robust anti-spoofing against modern codec-based generation, with code and data to be released to facilitate further research.

Abstract

Current state-of-the-art (SOTA) codec-based audio synthesis systems can mimic anyone's voice with just a 3-second sample from that specific unseen speaker. Unfortunately, malicious attackers may exploit these technologies, causing misuse and security issues. Anti-spoofing models have been developed to detect fake speech. However, the open question of whether current SOTA anti-spoofing models can effectively counter deepfake audios from codec-based speech synthesis systems remains unanswered. In this paper, we curate an extensive collection of contemporary SOTA codec models, employing them to re-create synthesized speech. This endeavor leads to the creation of CodecFake, the first codec-based deepfake audio dataset. Additionally, we verify that anti-spoofing models trained on commonly used datasets cannot detect synthesized speech from current codec-based speech generation systems. The proposed CodecFake dataset empowers these models to counter this challenge effectively.

CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems

TL;DR

The paper addresses impersonation risks posed by codec-based speech synthesis that can clone an unseen speaker from a few seconds of audio. It introduces CodecFake, a dataset assembled from 15 codec-model subsets across 6 frameworks by encoding real speech into discrete codes and decoding via multiple codec decoders. Experiments show ASVspoof baselines fail on codec-based deepfakes, while training on CodecFake yields strong detection, achieving EER on VALL-E and related demos. This work provides a practical benchmark and dataset to guide robust anti-spoofing against modern codec-based generation, with code and data to be released to facilitate further research.

Abstract

Current state-of-the-art (SOTA) codec-based audio synthesis systems can mimic anyone's voice with just a 3-second sample from that specific unseen speaker. Unfortunately, malicious attackers may exploit these technologies, causing misuse and security issues. Anti-spoofing models have been developed to detect fake speech. However, the open question of whether current SOTA anti-spoofing models can effectively counter deepfake audios from codec-based speech synthesis systems remains unanswered. In this paper, we curate an extensive collection of contemporary SOTA codec models, employing them to re-create synthesized speech. This endeavor leads to the creation of CodecFake, the first codec-based deepfake audio dataset. Additionally, we verify that anti-spoofing models trained on commonly used datasets cannot detect synthesized speech from current codec-based speech generation systems. The proposed CodecFake dataset empowers these models to counter this challenge effectively.
Paper Structure (12 sections, 2 figures, 3 tables)

This paper contains 12 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Codec-based speech generation pipeline.
  • Figure 2: Cross-testing results of anti-spoofing models on CodecFake test subsets. We report equal error rate (EER), where lower is better. Row AASIST and Row AASIST-L are baselines trained on the ASVspoof 2019 logical access training set using different architectures, while rows A-F6 are AASIST-L models trained on one training subset of CodecFake. Columns A-F6 indicate the specific CodecFake test subset evaluated.