Table of Contents
Fetching ...

Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models

Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva

TL;DR

This paper reframes deepfake voice detection as a speaker verification task using a reference set of genuine samples for the claimed identity, enabling a training-free approach that relies on embeddings from large pre-trained audio models. The method computes a maximum cosine similarity between the test embedding and reference embeddings, selecting the test as real if the similarity exceeds a threshold. Experimental results across ASVSpoof and InTheWild demonstrate strong generalization, with BEATs delivering the best overall performance and robustness, particularly in out-of-distribution scenarios. The findings highlight the potential of semantic-rich pre-trained representations for robust deepfake detection without fake-data training, suggesting practical deployment with limited reference data and no generation-method dependence.

Abstract

Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained for. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific fake detection or speaker verification datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets widespread in the community show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.

Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models

TL;DR

This paper reframes deepfake voice detection as a speaker verification task using a reference set of genuine samples for the claimed identity, enabling a training-free approach that relies on embeddings from large pre-trained audio models. The method computes a maximum cosine similarity between the test embedding and reference embeddings, selecting the test as real if the similarity exceeds a threshold. Experimental results across ASVSpoof and InTheWild demonstrate strong generalization, with BEATs delivering the best overall performance and robustness, particularly in out-of-distribution scenarios. The findings highlight the potential of semantic-rich pre-trained representations for robust deepfake detection without fake-data training, suggesting practical deployment with limited reference data and no generation-method dependence.

Abstract

Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained for. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific fake detection or speaker verification datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets widespread in the community show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.
Paper Structure (7 sections, 5 figures, 2 tables)

This paper contains 7 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Making decisions based on large pre-trained models. Audio signals are fed to the model to extract the corresponding embeddings. The decision statistic is computed in the latent space, as the maximum similarity score between the audio under test and the audios of the reference set.
  • Figure 2: T-SNE representation of real and fake feature embeddings of four identities taken from the InTheWild dataset.
  • Figure 3: AUC for the InTheWild dataset as a function of the number of audios in the reference set.
  • Figure 4: Histograms of maximum similarity scores for BEATs on InTheWild dataset.
  • Figure 5: Accuracy of BEATs on InTheWild dataset as a function of threshold.