Table of Contents
Fetching ...

Split and Conquer Partial Deepfake Speech

Inbal Rimon, Oren Gal, Haim Permuter

Abstract

Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer framework that decomposes the problem into two stages: boundary detection and segment-level classification. A dedicated boundary detector first identifies temporal transition points, allowing the audio signal to be divided into segments that are expected to contain acoustically consistent content. Each resulting segment is then evaluated independently to determine whether it corresponds to bona fide or fake speech. This formulation simplifies the learning objective by explicitly separating temporal localization from authenticity assessment, allowing each component to focus on a well-defined task. To further improve robustness, we introduce a reflection-based multi-length training strategy that converts variable-duration segments into several fixed input lengths, producing diverse feature-space representations. Each stage is trained using multiple configurations with different feature extractors and augmentation strategies, and their complementary predictions are fused to obtain improved final models. Experiments on the PartialSpoof benchmark demonstrate state-of-the-art performance across multiple temporal resolutions as well as at the utterance level, with substantial improvements in the accurate detection and localization of spoofed regions. In addition, the proposed method achieves state-of-the-art performance on the Half-Truth dataset, further confirming the robustness and generalization capability of the framework.

Split and Conquer Partial Deepfake Speech

Abstract

Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer framework that decomposes the problem into two stages: boundary detection and segment-level classification. A dedicated boundary detector first identifies temporal transition points, allowing the audio signal to be divided into segments that are expected to contain acoustically consistent content. Each resulting segment is then evaluated independently to determine whether it corresponds to bona fide or fake speech. This formulation simplifies the learning objective by explicitly separating temporal localization from authenticity assessment, allowing each component to focus on a well-defined task. To further improve robustness, we introduce a reflection-based multi-length training strategy that converts variable-duration segments into several fixed input lengths, producing diverse feature-space representations. Each stage is trained using multiple configurations with different feature extractors and augmentation strategies, and their complementary predictions are fused to obtain improved final models. Experiments on the PartialSpoof benchmark demonstrate state-of-the-art performance across multiple temporal resolutions as well as at the utterance level, with substantial improvements in the accurate detection and localization of spoofed regions. In addition, the proposed method achieves state-of-the-art performance on the Half-Truth dataset, further confirming the robustness and generalization capability of the framework.

Paper Structure

This paper contains 19 sections, 6 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of the proposed partial deepfake speech detection pipeline. (1) Input Audio: a full utterance containing both bona fide and manipulated regions. (2) Boundary Detection: a frame-level model predicts transition points between acoustic regions. (3) Audio Splitting: the signal is partitioned into segments according to the detected boundaries, producing candidate spoof-uniform regions. (4) Segment-Level Classification: each segment is independently evaluated by a classifier that assigns an authenticity score. (5) Frame-Level Detection: segment predictions are projected back onto the temporal axis to obtain fine-grained localization of manipulated speech at frame resolution.
  • Figure 2: Detection error trade-off (DET) curves of all single models and their score-level fusion on the evaluation set. Each curve reports the miss rate as a function of the false alarm rate, with both axes shown up to 40%. Individual configurations in the legend are denoted using the format feature extractor_augmentation_fixed input length, describing the feature representation, training augmentation, and segment input duration used by each model. While individual models exhibit complementary strengths in different operating regions, the fused system consistently dominates the single-model baselines, achieving lower miss rates across most false alarm rates and yielding the best overall trade-off between bona fide rejection and spoof acceptance.
  • Figure 3: Distribution of per-utterance EER obtained using the complete pipeline on the PartialSpoof evaluation set (71,239 utterances). Each bin represents the EER computed independently for a single utterance. The dashed vertical line indicates the average EER of 5.47%. Notably, 54.5% of utterances achieve zero EER, highlighting the skewed performance distribution across samples.
  • Figure 4: Log-magnitude spectrogram examples from three corpora. Top row: PartialSpoof, English. Middle row: HAD, Mandarin. Bottom row: LPS, English.