Table of Contents
Fetching ...

FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection

Minji Heo, Simon S. Woo

TL;DR

FakeChain introduces the first large-scale benchmark for 1-, 2-, and 3-step deepfakes generated by heterogeneous pipelines, enabling systematic study of how manipulation depth and final-step generators affect detection. The study reveals detectors trained on single-step final artifacts generalize poorly to multi-step forgeries, with performance heavily tied to the last manipulation; in some cases, cross-depth drops reach substantial levels (up to 58.83% F1). Through t-SNE, FFT, and mutual information analyses, the authors show final-step artifacts dominate representations and frequency-domain traces while earlier steps are progressively overwritten. The results underscore the need for history-aware detectors trained on diverse manipulation chains and highlight spectral behavior differences across generators, informing future forensic benchmark design and detector development.

Abstract

Multi-step or hybrid deepfakes, created by sequentially applying different deepfake creation methods such as Face-Swapping, GAN-based generation, and Diffusion methods, can pose an emerging and unforseen technical challenge for detection models trained on single-step forgeries. While prior studies have mainly focused on detecting isolated single manipulation, little is known about the detection model behavior under such compositional, hybrid, and complex manipulation pipelines. In this work, we introduce \textbf{FakeChain}, a large-scale benchmark comprising 1-, 2-, and 3-Step forgeries synthesized using five state-of-the-art representative generators. Using this approach, we analyze detection performance and spectral properties across hybrid manipulation at different step, along with varying generator combinations and quality settings. Surprisingly, our findings reveal that detection performance highly depends on the final manipulation type, with F1-score dropping by up to \textbf{58.83\%} when it differs from training distribution. This clearly demonstrates that detectors rely on last-stage artifacts rather than cumulative manipulation traces, limiting generalization. Such findings highlight the need for detection models to explicitly consider manipulation history and sequences. Our results highlight the importance of benchmarks such as FakeChain, reflecting growing synthesis complexity and diversity in real-world scenarios. Our sample code is available here\footnote{https://github.com/minjihh/FakeChain}.

FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection

TL;DR

FakeChain introduces the first large-scale benchmark for 1-, 2-, and 3-step deepfakes generated by heterogeneous pipelines, enabling systematic study of how manipulation depth and final-step generators affect detection. The study reveals detectors trained on single-step final artifacts generalize poorly to multi-step forgeries, with performance heavily tied to the last manipulation; in some cases, cross-depth drops reach substantial levels (up to 58.83% F1). Through t-SNE, FFT, and mutual information analyses, the authors show final-step artifacts dominate representations and frequency-domain traces while earlier steps are progressively overwritten. The results underscore the need for history-aware detectors trained on diverse manipulation chains and highlight spectral behavior differences across generators, informing future forensic benchmark design and detector development.

Abstract

Multi-step or hybrid deepfakes, created by sequentially applying different deepfake creation methods such as Face-Swapping, GAN-based generation, and Diffusion methods, can pose an emerging and unforseen technical challenge for detection models trained on single-step forgeries. While prior studies have mainly focused on detecting isolated single manipulation, little is known about the detection model behavior under such compositional, hybrid, and complex manipulation pipelines. In this work, we introduce \textbf{FakeChain}, a large-scale benchmark comprising 1-, 2-, and 3-Step forgeries synthesized using five state-of-the-art representative generators. Using this approach, we analyze detection performance and spectral properties across hybrid manipulation at different step, along with varying generator combinations and quality settings. Surprisingly, our findings reveal that detection performance highly depends on the final manipulation type, with F1-score dropping by up to \textbf{58.83\%} when it differs from training distribution. This clearly demonstrates that detectors rely on last-stage artifacts rather than cumulative manipulation traces, limiting generalization. Such findings highlight the need for detection models to explicitly consider manipulation history and sequences. Our results highlight the importance of benchmarks such as FakeChain, reflecting growing synthesis complexity and diversity in real-world scenarios. Our sample code is available here\footnote{https://github.com/minjihh/FakeChain}.

Paper Structure

This paper contains 29 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison between Single-Step and Multi-Step Deepfake Generation & Manipulation (FakeChain) Pipelines.
  • Figure 2: t-SNE visualizations of penultimate-layer features for real and manipulated images across different manipulation types and training settings. Each column corresponds to a manipulation type: FaceSwap (blue), GAN (orange), and Diffusion (green). Top row (a–c) shows features extracted from models trained on 1-Step manipulated data, while the bottom row (d–f) shows results from models trained on 2-Step manipulated data. Black points represent real images. Each plot illustrates how well different manipulation depths (1-, 2-, 3-Step) are clustered or separated from real samples under various training regimes.
  • Figure 3: Averaged FFT spectra when each manipulation method is applied as the final step. In each subfigure, the leftmost image shows the frequency response of a 1-Step manipulation, while the four images to the right represent 2-Step manipulations that share the same final generator but differ in the preceding generation stage.
  • Figure 4: Using StyleSwin as the final step in multi-stage manipulations leads to identity collapse, where outputs exhibit similar facial features regardless of the input source.