Table of Contents
Fetching ...

SynthForensics: A Multi-Generator Benchmark for Detecting Synthetic Video Deepfakes

Roberto Leotta, Salvatore Alfio Sambataro, Claudio Vittorio Ragaglia, Mirko Casu, Yuri Petralia, Francesco Guarnera, Luca Guarnera, Sebastiano Battiato

TL;DR

SynthForensics introduces a first human-centric benchmark for purely synthetic video deepfakes, leveraging a paired-source protocol across five open-source T2V models to produce 6,815 high-quality samples with four compression variants. The study reveals current detectors struggle dramatically in zero-shot settings and under compression on synthetic content, while fine-tuning and generator-based training yield strong forward generalization within synthetic domains but poor backward transfer to legacy manipulation-based benchmarks. The work provides extensive metadata, prompts, and rigorous validation pipelines to enable reproducibility and further research into robust, generalizable detection methods. It highlights the need for detectors tailored to generation artifacts and contextualized evaluation to safeguard multimedia authenticity in the era of accessible high-fidelity synthesis.

Abstract

The landscape of synthetic media has been irrevocably altered by text-to-video (T2V) models, whose outputs are rapidly approaching indistinguishability from reality. Critically, this technology is no longer confined to large-scale labs; the proliferation of efficient, open-source generators is democratizing the ability to create high-fidelity synthetic content on consumer-grade hardware. This makes existing face-centric and manipulation-based benchmarks obsolete. To address this urgent threat, we introduce SynthForensics, to the best of our knowledge the first human-centric benchmark for detecting purely synthetic video deepfakes. The benchmark comprises 6,815 unique videos from five architecturally distinct, state-of-the-art open-source T2V models. Its construction was underpinned by a meticulous two-stage, human-in-the-loop validation to ensure high semantic and visual quality. Each video is provided in four versions (raw, lossless, light, and heavy compression) to enable real-world robustness testing. Experiments demonstrate that state-of-the-art detectors are both fragile and exhibit limited generalization when evaluated on this new domain: we observe a mean performance drop of $29.19\%$ AUC, with some methods performing worse than random chance, and top models losing over 30 points under heavy compression. The paper further investigates the efficacy of training on SynthForensics as a means to mitigate these observed performance gaps, achieving robust generalization to unseen generators ($93.81\%$ AUC), though at the cost of reduced backward compatibility with traditional manipulation-based deepfakes. The complete dataset and all generation metadata, including the specific prompts and inference parameters for every video, will be made publicly available at [link anonymized for review].

SynthForensics: A Multi-Generator Benchmark for Detecting Synthetic Video Deepfakes

TL;DR

SynthForensics introduces a first human-centric benchmark for purely synthetic video deepfakes, leveraging a paired-source protocol across five open-source T2V models to produce 6,815 high-quality samples with four compression variants. The study reveals current detectors struggle dramatically in zero-shot settings and under compression on synthetic content, while fine-tuning and generator-based training yield strong forward generalization within synthetic domains but poor backward transfer to legacy manipulation-based benchmarks. The work provides extensive metadata, prompts, and rigorous validation pipelines to enable reproducibility and further research into robust, generalizable detection methods. It highlights the need for detectors tailored to generation artifacts and contextualized evaluation to safeguard multimedia authenticity in the era of accessible high-fidelity synthesis.

Abstract

The landscape of synthetic media has been irrevocably altered by text-to-video (T2V) models, whose outputs are rapidly approaching indistinguishability from reality. Critically, this technology is no longer confined to large-scale labs; the proliferation of efficient, open-source generators is democratizing the ability to create high-fidelity synthetic content on consumer-grade hardware. This makes existing face-centric and manipulation-based benchmarks obsolete. To address this urgent threat, we introduce SynthForensics, to the best of our knowledge the first human-centric benchmark for detecting purely synthetic video deepfakes. The benchmark comprises 6,815 unique videos from five architecturally distinct, state-of-the-art open-source T2V models. Its construction was underpinned by a meticulous two-stage, human-in-the-loop validation to ensure high semantic and visual quality. Each video is provided in four versions (raw, lossless, light, and heavy compression) to enable real-world robustness testing. Experiments demonstrate that state-of-the-art detectors are both fragile and exhibit limited generalization when evaluated on this new domain: we observe a mean performance drop of AUC, with some methods performing worse than random chance, and top models losing over 30 points under heavy compression. The paper further investigates the efficacy of training on SynthForensics as a means to mitigate these observed performance gaps, achieving robust generalization to unseen generators ( AUC), though at the cost of reduced backward compatibility with traditional manipulation-based deepfakes. The complete dataset and all generation metadata, including the specific prompts and inference parameters for every video, will be made publicly available at [link anonymized for review].
Paper Structure (69 sections, 5 equations, 10 figures, 60 tables)

This paper contains 69 sections, 5 equations, 10 figures, 60 tables.

Figures (10)

  • Figure 1: The paired-source data generation pipeline for SynthForensics. Real source videos (left) from FF++ and DFD are analyzed to extract structured metadata, which guides five text-to-video (T2V) models (Wan2.1, SkyReels-V2, Self-Forcing, CogVideoX, MAGI-1) to generate synthetic counterparts (right).
  • Figure 2: The SynthForensics benchmark. Starting from 1,363 source videos (FF++, DFD), a VLM generates structured 7-field descriptions validated by human reviewers. Each prompt is optimized for five T2V models (CogVideoX, MAGI-1, Self-Forcing, SkyReels-V2, Wan2.1). Synthesized videos undergo human validation before processing into four compression versions (Raw, Canonical, CRF23, CRF40). The final benchmark comprises 6,815 unique videos (27,260 total) with full metadata (prompts, parameters) for reproducibility.
  • Figure 3: Zero-shot performance (Video AUC %) of state-of-the-art detectors on the SF-FF++ test set across three different versions: ( Canonical), ( CRF23), and ( CRF40).
  • Figure 4: The exact System Prompt utilized with VideoLLaMA 3 zhang2025videollama to extract the structured positive prompts that guide the synthetic video generation. The schema enforces a 7-field decomposition to ensure separation between aesthetic, semantic, and technical video attributes.
  • Figure 5: Representative example of the raw structured metadata extracted from a source video.
  • ...and 5 more figures