Table of Contents
Fetching ...

How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection

Yixuan Xiao, Florian Lux, Alejandro Pérez-González-de-Martos, Ngoc Thang Vu

TL;DR

This work addresses how to label resynthesized audio produced by neural audio codecs (CoRS) in deepfake detection, recognizing a dual-use threat where codecs support both transmission and synthesis (CoSG). It introduces CodecDeepfakeDetection, an open, ASVspoof 5–based dataset that combines multiple TTS systems and neural audio codecs to study labeling strategies. Through experiments with different labeling schemes and augmentation strategies, the study shows that the best labeling approach depends on the codec’s design objective: compression-oriented NACs benefit from treating CoRS as spoof for synthesis-oriented detection, while others benefit from CoRS-as-bonafide augmentation to learn codec-invariant cues. The findings highlight the need to disentangle codec artifacts from spoof cues and point to NAC design as a key factor shaping detection performance and labeling policies, with practical implications for robust audio deepfake defenses.

Abstract

Since Text-to-Speech systems typically don't produce waveforms directly, recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker. Unlike vocoders, which are specifically designed for speech synthesis, neural audio codecs were originally developed for compressing audio for storage and transmission. However, their ability to discretize speech also sparked interest in language-modeling-based speech synthesis. Owing to this dual functionality, codec resynthesized data may be labeled as either bonafide or spoof. So far, very little research has addressed this issue. In this study, we present a challenging extension of the ASVspoof 5 dataset constructed for this purpose. We examine how different labeling choices affect detection performance and provide insights into labeling strategies.

How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection

TL;DR

This work addresses how to label resynthesized audio produced by neural audio codecs (CoRS) in deepfake detection, recognizing a dual-use threat where codecs support both transmission and synthesis (CoSG). It introduces CodecDeepfakeDetection, an open, ASVspoof 5–based dataset that combines multiple TTS systems and neural audio codecs to study labeling strategies. Through experiments with different labeling schemes and augmentation strategies, the study shows that the best labeling approach depends on the codec’s design objective: compression-oriented NACs benefit from treating CoRS as spoof for synthesis-oriented detection, while others benefit from CoRS-as-bonafide augmentation to learn codec-invariant cues. The findings highlight the need to disentangle codec artifacts from spoof cues and point to NAC design as a key factor shaping detection performance and labeling policies, with practical implications for robust audio deepfake defenses.

Abstract

Since Text-to-Speech systems typically don't produce waveforms directly, recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker. Unlike vocoders, which are specifically designed for speech synthesis, neural audio codecs were originally developed for compressing audio for storage and transmission. However, their ability to discretize speech also sparked interest in language-modeling-based speech synthesis. Owing to this dual functionality, codec resynthesized data may be labeled as either bonafide or spoof. So far, very little research has addressed this issue. In this study, we present a challenging extension of the ASVspoof 5 dataset constructed for this purpose. We examine how different labeling choices affect detection performance and provide insights into labeling strategies.
Paper Structure (14 sections, 1 figure, 5 tables)

This paper contains 14 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Overview of the setting we investigate: Modern attackers utilize NACs to obtain discrete tokens for language modeling. Human speech is also encoded with NACs for storage or transmission. Real and fake speech sometimes share the same NAC decoders. Can we still detect the fakes?