Table of Contents
Fetching ...

Source Tracing of Audio Deepfake Systems

Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, Elie Khoury

TL;DR

This work introduces a system designed to classify various spoofing attributes, capturing the distinctive features of individual modules throughout the entire generation pipeline of audio deepfakes.

Abstract

Recent progress in generative AI technology has made audio deepfakes remarkably more realistic. While current research on anti-spoofing systems primarily focuses on assessing whether a given audio sample is fake or genuine, there has been limited attention on discerning the specific techniques to create the audio deepfakes. Algorithms commonly used in audio deepfake generation, like text-to-speech (TTS) and voice conversion (VC), undergo distinct stages including input processing, acoustic modeling, and waveform generation. In this work, we introduce a system designed to classify various spoofing attributes, capturing the distinctive features of individual modules throughout the entire generation pipeline. We evaluate our system on two datasets: the ASVspoof 2019 Logical Access and the Multi-Language Audio Anti-Spoofing Dataset (MLAAD). Results from both experiments demonstrate the robustness of the system to identify the different spoofing attributes of deepfake generation systems.

Source Tracing of Audio Deepfake Systems

TL;DR

This work introduces a system designed to classify various spoofing attributes, capturing the distinctive features of individual modules throughout the entire generation pipeline of audio deepfakes.

Abstract

Recent progress in generative AI technology has made audio deepfakes remarkably more realistic. While current research on anti-spoofing systems primarily focuses on assessing whether a given audio sample is fake or genuine, there has been limited attention on discerning the specific techniques to create the audio deepfakes. Algorithms commonly used in audio deepfake generation, like text-to-speech (TTS) and voice conversion (VC), undergo distinct stages including input processing, acoustic modeling, and waveform generation. In this work, we introduce a system designed to classify various spoofing attributes, capturing the distinctive features of individual modules throughout the entire generation pipeline. We evaluate our system on two datasets: the ASVspoof 2019 Logical Access and the Multi-Language Audio Anti-Spoofing Dataset (MLAAD). Results from both experiments demonstrate the robustness of the system to identify the different spoofing attributes of deepfake generation systems.
Paper Structure (13 sections, 3 figures, 5 tables)

This paper contains 13 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of proposed frameworks for spoofing attribute-classification. Top: End-to-end learning from audio. Bottom: Two-stage learning that includes a traditional countermeasure (CM) and an auxiliary classifier trained on embeddings.
  • Figure 2: Confusion matrix for ResNet (E2E) acoustic model predictions on the MLAAD evaluation set. Prediction counts are normalized by true label counts (by row). T: Tacotron2
  • Figure 3: Embeddings from our top performing models on the acoustic model classification task of each of our protocols, plotted using UMAP dimensionality reduction with $n\_neighbors=50$. Left: ASVSpoof embeddings from SSL (E2E) model. Right: MLAAD embeddings from ResNet (E2E) model.