Source Verification for Speech Deepfakes
Viola Negroni, Davide Salvi, Paolo Bestagini, Stefano Tubaro
TL;DR
This work defines source verification for speech deepfakes, reframing attribution as a verification task that uses embeddings from a source-attribution classifier to compare a test track against a reference set via cosine similarity, enabling open-set generalization to unseen generators. The method demonstrates strong generalization across multiple datasets, with ResNet18 often delivering the best performance, but reveals vulnerabilities to speaker diversity, language mismatches, and post-processing like speech enhancement. The approach offers a scalable, forensic-friendly framework for tracing synthetic speech origins, potentially improving content authenticity checks and security analyses in real-world scenarios. Overall, it provides a foundation for scalable, open-set forensic tooling for source tracing of speech deepfakes and suggests directions to improve embedding robustness and cross-language applicability.
Abstract
With the proliferation of speech deepfake generators, it becomes crucial not only to assess the authenticity of synthetic audio but also to trace its origin. While source attribution models attempt to address this challenge, they often struggle in open-set conditions against unseen generators. In this paper, we introduce the source verification task, which, inspired by speaker verification, determines whether a test track was produced using the same model as a set of reference signals. Our approach leverages embeddings from a classifier trained for source attribution, computing distance scores between tracks to assess whether they originate from the same source. We evaluate multiple models across diverse scenarios, analyzing the impact of speaker diversity, language mismatch, and post-processing operations. This work provides the first exploration of source verification, highlighting its potential and vulnerabilities, and offers insights for real-world forensic applications.
