Table of Contents
Fetching ...

Source Verification for Speech Deepfakes

Viola Negroni, Davide Salvi, Paolo Bestagini, Stefano Tubaro

TL;DR

This work defines source verification for speech deepfakes, reframing attribution as a verification task that uses embeddings from a source-attribution classifier to compare a test track against a reference set via cosine similarity, enabling open-set generalization to unseen generators. The method demonstrates strong generalization across multiple datasets, with ResNet18 often delivering the best performance, but reveals vulnerabilities to speaker diversity, language mismatches, and post-processing like speech enhancement. The approach offers a scalable, forensic-friendly framework for tracing synthetic speech origins, potentially improving content authenticity checks and security analyses in real-world scenarios. Overall, it provides a foundation for scalable, open-set forensic tooling for source tracing of speech deepfakes and suggests directions to improve embedding robustness and cross-language applicability.

Abstract

With the proliferation of speech deepfake generators, it becomes crucial not only to assess the authenticity of synthetic audio but also to trace its origin. While source attribution models attempt to address this challenge, they often struggle in open-set conditions against unseen generators. In this paper, we introduce the source verification task, which, inspired by speaker verification, determines whether a test track was produced using the same model as a set of reference signals. Our approach leverages embeddings from a classifier trained for source attribution, computing distance scores between tracks to assess whether they originate from the same source. We evaluate multiple models across diverse scenarios, analyzing the impact of speaker diversity, language mismatch, and post-processing operations. This work provides the first exploration of source verification, highlighting its potential and vulnerabilities, and offers insights for real-world forensic applications.

Source Verification for Speech Deepfakes

TL;DR

This work defines source verification for speech deepfakes, reframing attribution as a verification task that uses embeddings from a source-attribution classifier to compare a test track against a reference set via cosine similarity, enabling open-set generalization to unseen generators. The method demonstrates strong generalization across multiple datasets, with ResNet18 often delivering the best performance, but reveals vulnerabilities to speaker diversity, language mismatches, and post-processing like speech enhancement. The approach offers a scalable, forensic-friendly framework for tracing synthetic speech origins, potentially improving content authenticity checks and security analyses in real-world scenarios. Overall, it provides a foundation for scalable, open-set forensic tooling for source tracing of speech deepfakes and suggests directions to improve embedding robustness and cross-language applicability.

Abstract

With the proliferation of speech deepfake generators, it becomes crucial not only to assess the authenticity of synthetic audio but also to trace its origin. While source attribution models attempt to address this challenge, they often struggle in open-set conditions against unseen generators. In this paper, we introduce the source verification task, which, inspired by speaker verification, determines whether a test track was produced using the same model as a set of reference signals. Our approach leverages embeddings from a classifier trained for source attribution, computing distance scores between tracks to assess whether they originate from the same source. We evaluate multiple models across diverse scenarios, analyzing the impact of speaker diversity, language mismatch, and post-processing operations. This work provides the first exploration of source verification, highlighting its potential and vulnerabilities, and offers insights for real-world forensic applications.

Paper Structure

This paper contains 15 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Pipeline of the proposed source verification method for speech deepfakes. A Feature Extractor $\mathcal{C}$ is leveraged as an embedding extractor for both the test track and the tracks in the reference set. Then, the cosine similarity is computed between the embedding of the test track and each reference embedding. The decision is made by taking the maximum similarity.