Audio Deepfake Attribution: An Initial Dataset and Investigation

Xinrui Yan; Jiangyan Yi; Jianhua Tao; Jie Chen

Audio Deepfake Attribution: An Initial Dataset and Investigation

Xinrui Yan, Jiangyan Yi, Jianhua Tao, Jie Chen

TL;DR

The Class-Representation Multi-Center Learning (CRML) method for open-set audio deepfake attribution (OSADA) enhances the global directional variation of representations, ensuring the learning of discriminative representations with strong intra-class similarity and inter-class discrepancy among known classes.

Abstract

The rapid progress of deep speech synthesis models has posed significant threats to society such as malicious manipulation of content. This has led to an increase in studies aimed at detecting so-called deepfake audio. However, existing works focus on the binary detection of real audio and fake audio. In real-world scenarios such as model copyright protection and digital evidence forensics, binary classification alone is insufficient. It is essential to identify the source of deepfake audio. Therefore, audio deepfake attribution has emerged as a new challenge. To this end, we designed the first deepfake audio dataset for the attribution of audio generation tools, called Audio Deepfake Attribution (ADA), and conducted a comprehensive investigation on system fingerprints. To address the challenges of attribution of continuously emerging unknown audio generation tools in the real world, we propose the Class-Representation Multi-Center Learning (CRML) method for open-set audio deepfake attribution (OSADA). CRML enhances the global directional variation of representations, ensuring the learning of discriminative representations with strong intra-class similarity and inter-class discrepancy among known classes. Finally, the strong class discrimination capability learned from known classes is extended to both known and unknown classes. Experimental results demonstrate that the CRML method effectively addresses open-set risks in real-world scenarios. The dataset is publicly available at: https://zenodo.org/records/13318702, and https://zenodo.org/records/13340666.

Audio Deepfake Attribution: An Initial Dataset and Investigation

TL;DR

Abstract

Paper Structure (20 sections, 7 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 20 sections, 7 equations, 5 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Dataset Design
Design Policy
Deepfake Audio Collection
Real Audio Collection
Data Compression
Dataset Composition
Open-World Audio Deepfake Attribution: Attribution of Unknown Audio Generation Tools
Class-Representation Multi-Center Learning (CRML)
Inference
Experiments
Evaluation Metrics
Experimental Setup
Which features and models can better distinguish the attribution of system fingerprints?
...and 5 more sections

Figures (5)

Figure 1: Beyond the Limitation of Binary Deepfake Detection. (a) Previous audio deepfake detection focuses on distinguishing between real and fake audio. (b) Audio deepfake attribution aims to attribute audio generated by different deepfake technologies to their sources. Fake 1-Fake N refers to different audio generation algorithms or tools.
Figure 2: Partitioning and construction of the SFR dataset. The top section shows the partitions of the clean set, as well as the number of speakers and the number of utterances contained in each set. The middle section simulates the data compression scenario of the data propagation process in the real world, and the bottom section shows the partitioning of the compressed set. Both the clean set and the compressed set contain roughly equal numbers of utterances.
Figure 3: A schematic diagram of OSADA based on class-representation multi-center learning (CRML). Firstly, CRML fosters intra-class compactness and inter-class separability within the representation space. To ensure the separation between known and unknown classes in real-world space, CRML enhances the global directional variation of representations. The robust class discriminative capability learned from known classes can effectively distinguish the relationship between known and unknown classes.
Figure 4: A t-SNE visual of system fingerprint features of deepfake audio in the real-world. The result shows that differences among TTS systems can lead to effective attribution of distinct fingerprint features.
Figure 5: A spectrogram comparison of the deepfake audio from five speech synthesis systems and the real audio, the content of each audio is ‘太阳光中蓝紫光波长较短.' (‘The blue-violet light in sunlight has a shorter wavelength.')..

Audio Deepfake Attribution: An Initial Dataset and Investigation

TL;DR

Abstract

Audio Deepfake Attribution: An Initial Dataset and Investigation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)