Source Tracing of Audio Deepfake Systems

Nicholas Klein; Tianxiang Chen; Hemlata Tak; Ricardo Casal; Elie Khoury

Source Tracing of Audio Deepfake Systems

Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, Elie Khoury

TL;DR

This work introduces a system designed to classify various spoofing attributes, capturing the distinctive features of individual modules throughout the entire generation pipeline of audio deepfakes.

Abstract

Recent progress in generative AI technology has made audio deepfakes remarkably more realistic. While current research on anti-spoofing systems primarily focuses on assessing whether a given audio sample is fake or genuine, there has been limited attention on discerning the specific techniques to create the audio deepfakes. Algorithms commonly used in audio deepfake generation, like text-to-speech (TTS) and voice conversion (VC), undergo distinct stages including input processing, acoustic modeling, and waveform generation. In this work, we introduce a system designed to classify various spoofing attributes, capturing the distinctive features of individual modules throughout the entire generation pipeline. We evaluate our system on two datasets: the ASVspoof 2019 Logical Access and the Multi-Language Audio Anti-Spoofing Dataset (MLAAD). Results from both experiments demonstrate the robustness of the system to identify the different spoofing attributes of deepfake generation systems.

Source Tracing of Audio Deepfake Systems

TL;DR

This work introduces a system designed to classify various spoofing attributes, capturing the distinctive features of individual modules throughout the entire generation pipeline of audio deepfakes.

Abstract

Paper Structure (13 sections, 3 figures, 5 tables)

This paper contains 13 sections, 3 figures, 5 tables.

Introduction
Attribute classification of spoof systems
Proposed strategies
Countermeasures
Datasets and protocols
ASVspoof 2019
MLAAD
Experimental Results
Implementation details
Results on ASVspoof 2019
Results on MLAAD
Embedding visualization
Conclusions and Discussions

Figures (3)

Figure 1: Illustration of proposed frameworks for spoofing attribute-classification. Top: End-to-end learning from audio. Bottom: Two-stage learning that includes a traditional countermeasure (CM) and an auxiliary classifier trained on embeddings.
Figure 2: Confusion matrix for ResNet (E2E) acoustic model predictions on the MLAAD evaluation set. Prediction counts are normalized by true label counts (by row). T: Tacotron2
Figure 3: Embeddings from our top performing models on the acoustic model classification task of each of our protocols, plotted using UMAP dimensionality reduction with $n\_neighbors=50$. Left: ASVSpoof embeddings from SSL (E2E) model. Right: MLAAD embeddings from ResNet (E2E) model.

Source Tracing of Audio Deepfake Systems

TL;DR

Abstract

Source Tracing of Audio Deepfake Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (3)