Table of Contents
Fetching ...

Analyzing the Impact of Splicing Artifacts in Partially Fake Speech Signals

Viola Negroni, Davide Salvi, Paolo Bestagini, Stefano Tubaro

TL;DR

This work examines how concatenation-induced artifacts—specifically spectral leakage at splice points—affect partially fake speech datasets (PartialSpoof and HAD) and the detectors trained on them. By analyzing both simple sinusoidal concatenations and real-world spliced tracks, the study shows that induced splicing artifacts are detectable via a lightweight dynamic-range analysis on STFT features, achieving an average EER of $6.16\%$ (PartialSpoof) and $7.36\%$ (HAD) without training detectors. The authors demonstrate that these artifacts can bias frequency-domain detectors, and they explore mitigation strategies (windowing, smoothing, source-data selection, high-pass filtering) and retraining approaches to reduce such biases. The results highlight the need for careful dataset design and model training to ensure detectors generalize beyond artifact-driven cues in realistic, spliced speech scenarios.

Abstract

Speech deepfake detection has recently gained significant attention within the multimedia forensics community. Related issues have also been explored, such as the identification of partially fake signals, i.e., tracks that include both real and fake speech segments. However, generating high-quality spliced audio is not as straightforward as it may appear. Spliced signals are typically created through basic signal concatenation. This process could introduce noticeable artifacts that can make the generated data easier to detect. We analyze spliced audio tracks resulting from signal concatenation, investigate their artifacts and assess whether such artifacts introduce any bias in existing datasets. Our findings reveal that by analyzing splicing artifacts, we can achieve a detection EER of 6.16% and 7.36% on PartialSpoof and HAD datasets, respectively, without needing to train any detector. These results underscore the complexities of generating reliable spliced audio data and lead to discussions that can help improve future research in this area.

Analyzing the Impact of Splicing Artifacts in Partially Fake Speech Signals

TL;DR

This work examines how concatenation-induced artifacts—specifically spectral leakage at splice points—affect partially fake speech datasets (PartialSpoof and HAD) and the detectors trained on them. By analyzing both simple sinusoidal concatenations and real-world spliced tracks, the study shows that induced splicing artifacts are detectable via a lightweight dynamic-range analysis on STFT features, achieving an average EER of (PartialSpoof) and (HAD) without training detectors. The authors demonstrate that these artifacts can bias frequency-domain detectors, and they explore mitigation strategies (windowing, smoothing, source-data selection, high-pass filtering) and retraining approaches to reduce such biases. The results highlight the need for careful dataset design and model training to ensure detectors generalize beyond artifact-driven cues in realistic, spliced speech scenarios.

Abstract

Speech deepfake detection has recently gained significant attention within the multimedia forensics community. Related issues have also been explored, such as the identification of partially fake signals, i.e., tracks that include both real and fake speech segments. However, generating high-quality spliced audio is not as straightforward as it may appear. Spliced signals are typically created through basic signal concatenation. This process could introduce noticeable artifacts that can make the generated data easier to detect. We analyze spliced audio tracks resulting from signal concatenation, investigate their artifacts and assess whether such artifacts introduce any bias in existing datasets. Our findings reveal that by analyzing splicing artifacts, we can achieve a detection EER of 6.16% and 7.36% on PartialSpoof and HAD datasets, respectively, without needing to train any detector. These results underscore the complexities of generating reliable spliced audio data and lead to discussions that can help improve future research in this area.
Paper Structure (11 sections, 1 equation, 6 figures, 2 tables)

This paper contains 11 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Frequency domain analysis of a sinusoid of frequency $f_0$ with two different analysis windows. $f_\text{s}$ = 16kHz, $f_0$ = 800Hz, period $T = 20$ samples. FFT up to Nyquist frequency and Spectrogram (dB). Left: signal portion in the time domain. Center: DFT. Right: STFT magnitude. Top: STFT window of $L =80$ samples (multiple of $T$). Bottom: STFT window of $L =88$ samples (not a multiple of $T$), revealing spectral artifacts.
  • Figure 2: Frequency domain analysis of two concatenated sinusoids in different setups. Left: signal in the time domain. Right: STFT magnitude.
  • Figure 3: Frequency domain analysis of one example track per dataset: CON_D_0000001.wav from PartialSpoof (left) and ADD2023_T2_D_00000036.wav from HAD (right). STFT window size: 2048 samples, Hop size: 256 samples, no zero-padding. Top: full log-scaled spectrograms. Bottom: ground truth splicing timestamps, color changes indicate the splicing points.
  • Figure 4: ROC and AUC values of the detectors trained and tested on the original PartialSpoof dataset.
  • Figure 5: ROC and AUC values of the detectors trained on the original PartialSpoof dataset and tested on its high-pass filtered version.
  • ...and 1 more figures