Table of Contents
Fetching ...

DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization

Siran Peng, Haoyuan Zhang, Li Gao, Tianshuo Zhang, Xiangyu Zhu, Bao Li, Weisong Zhao, Zhen Lei

TL;DR

DiffusionFF addresses the dual need for accurate face forgery detection and fine-grained artifact localization by introducing a diffusion-based decoder conditioned on multi-scale features from a pretrained forgery detector. The framework uses an encoder–decoder architecture where the detector acts as an artifact encoder and a denoising diffusion model serves as the artifact decoder to generate precise DSSIM maps, which are fused with high-level detector features to produce the final decision. Through a two-stage training strategy and extensive experiments on FF++ and cross-dataset benchmarks, DiffusionFF achieves state-of-the-art detection performance and superior artifact localization, while also offering improved explainability. Its plug-and-play nature as an auxiliary module further enhances existing detectors, though diffusion-based inference remains computationally intensive.

Abstract

The rapid evolution of deepfake technologies demands robust and reliable face forgery detection algorithms. While determining whether an image has been manipulated remains essential, the ability to precisely localize forgery clues is also important for enhancing model explainability and building user trust. To address this dual challenge, we introduce DiffusionFF, a diffusion-based framework that simultaneously performs face forgery detection and fine-grained artifact localization. Our key idea is to establish a novel encoder-decoder architecture: a pretrained forgery detector serves as a powerful "artifact encoder", and a denoising diffusion model is repurposed as an "artifact decoder". Conditioned on multi-scale forgery-related features extracted by the encoder, the decoder progressively synthesizes a detailed artifact localization map. We then fuse this fine-grained localization map with high-level semantic features from the forgery detector, leading to substantial improvements in detection capability. Extensive experiments show that DiffusionFF achieves state-of-the-art (SOTA) performance across multiple benchmarks, underscoring its superior effectiveness and explainability.

DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization

TL;DR

DiffusionFF addresses the dual need for accurate face forgery detection and fine-grained artifact localization by introducing a diffusion-based decoder conditioned on multi-scale features from a pretrained forgery detector. The framework uses an encoder–decoder architecture where the detector acts as an artifact encoder and a denoising diffusion model serves as the artifact decoder to generate precise DSSIM maps, which are fused with high-level detector features to produce the final decision. Through a two-stage training strategy and extensive experiments on FF++ and cross-dataset benchmarks, DiffusionFF achieves state-of-the-art detection performance and superior artifact localization, while also offering improved explainability. Its plug-and-play nature as an auxiliary module further enhances existing detectors, though diffusion-based inference remains computationally intensive.

Abstract

The rapid evolution of deepfake technologies demands robust and reliable face forgery detection algorithms. While determining whether an image has been manipulated remains essential, the ability to precisely localize forgery clues is also important for enhancing model explainability and building user trust. To address this dual challenge, we introduce DiffusionFF, a diffusion-based framework that simultaneously performs face forgery detection and fine-grained artifact localization. Our key idea is to establish a novel encoder-decoder architecture: a pretrained forgery detector serves as a powerful "artifact encoder", and a denoising diffusion model is repurposed as an "artifact decoder". Conditioned on multi-scale forgery-related features extracted by the encoder, the decoder progressively synthesizes a detailed artifact localization map. We then fuse this fine-grained localization map with high-level semantic features from the forgery detector, leading to substantial improvements in detection capability. Extensive experiments show that DiffusionFF achieves state-of-the-art (SOTA) performance across multiple benchmarks, underscoring its superior effectiveness and explainability.

Paper Structure

This paper contains 50 sections, 5 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Visual comparison of DiffusionFF with mask-based and existing DSSIM-based artifact localization methods across four manipulation types (DF, F2F, FS, and NT) from the FaceForensics++ (FF++) dataset Rossler_2019_ICCV. The Ground-Truth (GT) DSSIM map is generated by comparing each fake image with its corresponding real image using the algorithm detailed in Section \ref{['gt_ssim']}.
  • Figure 2: Correlation between the quality of the estimated DSSIM maps and detection performance. We fuse the DSSIM maps estimated by LiSiam wang2022lisiam, LRL chen2021local, U-Net ronneberger2015u, and our DiffusionFF with high-level semantic features from a shared forgery detector to obtain classification results. Integrating estimated DSSIM maps into the detection network consistently improves performance, with higher-quality maps leading to greater performance gains.
  • Figure 3: Overview of the DiffusionFF framework. Given an input facial image, DiffusionFF simultaneously predicts a forgery score and estimates a DSSIM map that precisely localizes fine-grained facial forgery clues. $N$ denotes the number of stages in the forgery detector.
  • Figure 4: Qualitative DSSIM map estimation results on CDF2. DiffusionFF demonstrates strong generalization ability by producing more precise and fine-grained artifact localization results.
  • Figure 5: Qualitative DSSIM map estimation results on the FF++ dataset. Our method achieves the most visually faithful outcomes.
  • ...and 5 more figures