Table of Contents
Fetching ...

Forensic Similarity for Speech Deepfakes

Viola Negroni, Davide Salvi, Daniele Ugo Leonzio, Paolo Bestagini, Stefano Tubaro

TL;DR

This paper introduces forensic similarity for speech deepfakes, recasting source verification as a comparison of generator-specific forensic traces between audio pairs. It presents a two-part Siamese framework: a feature extractor backbone (LCNN/RawNet2/ResNet18/AASIST) trained for source tracing, and a lightweight similarity network that outputs a score $S\in[0,1]$ to indicate shared forensic traces. The approach generalizes well to unseen generators and supports splicing detection, outperforming baseline similarity measures and showing robust open-set performance across MLAAD, ASVspoof 2019, and TIMIT-TTS datasets. It demonstrates practical applicability for digital audio forensics with reasonable resilience to short-duration inputs and promising splicing-point localization, while outlining future work to further mitigate linguistic cues and improve splice localization accuracy.

Abstract

In this paper, we introduce a digital audio forensics approach called Forensic Similarity for Speech Deepfakes, which determines whether two audio segments contain the same forensic traces or not. Our work is inspired by prior work in the image domain on forensic similarity, which proved strong generalization capabilities against unknown forensic traces, without requiring prior knowledge of them at training time. To achieve this in the audio setting, we propose a two-part deep-learning system composed of a feature extractor based on a speech deepfake detector backbone and a shallow neural network, referred to as the similarity network. This system maps pairs of audio segments to a score indicating whether they contain the same or different forensic traces. We evaluate the system on the emerging task of source verification, highlighting its ability to identify whether two samples originate from the same generative model. Additionally, we assess its applicability to splicing detection as a complementary use case. Experiments show that the method generalizes to a wide range of forensic traces, including previously unseen ones, illustrating its flexibility and practical value in digital audio forensics.

Forensic Similarity for Speech Deepfakes

TL;DR

This paper introduces forensic similarity for speech deepfakes, recasting source verification as a comparison of generator-specific forensic traces between audio pairs. It presents a two-part Siamese framework: a feature extractor backbone (LCNN/RawNet2/ResNet18/AASIST) trained for source tracing, and a lightweight similarity network that outputs a score to indicate shared forensic traces. The approach generalizes well to unseen generators and supports splicing detection, outperforming baseline similarity measures and showing robust open-set performance across MLAAD, ASVspoof 2019, and TIMIT-TTS datasets. It demonstrates practical applicability for digital audio forensics with reasonable resilience to short-duration inputs and promising splicing-point localization, while outlining future work to further mitigate linguistic cues and improve splice localization accuracy.

Abstract

In this paper, we introduce a digital audio forensics approach called Forensic Similarity for Speech Deepfakes, which determines whether two audio segments contain the same forensic traces or not. Our work is inspired by prior work in the image domain on forensic similarity, which proved strong generalization capabilities against unknown forensic traces, without requiring prior knowledge of them at training time. To achieve this in the audio setting, we propose a two-part deep-learning system composed of a feature extractor based on a speech deepfake detector backbone and a shallow neural network, referred to as the similarity network. This system maps pairs of audio segments to a score indicating whether they contain the same or different forensic traces. We evaluate the system on the emerging task of source verification, highlighting its ability to identify whether two samples originate from the same generative model. Additionally, we assess its applicability to splicing detection as a complementary use case. Experiments show that the method generalizes to a wide range of forensic traces, including previously unseen ones, illustrating its flexibility and practical value in digital audio forensics.

Paper Structure

This paper contains 20 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the proposed framework.
  • Figure 2: Example of similarity score sequence computed by the proposed framework for the splicing detection use case over a partially spoofed track (blue segments indicate genuine speech, red synthetic).
  • Figure 3: Detection rates of the proposed framework using LCNN as the feature extractor, with fine-tuning applied during the second learning phase, evaluated for each generator pair on the MLAAD test set. Lighter cells indicate higher detection rates.
  • Figure 4: ROC curves for splicing detection on PartialSpoof using input pairs of 0.5 seconds with a stride of 0.05 seconds.