A Large-scale Interpretable Multi-modality Benchmark for Facial Image Forgery Localization

Jingchun Lian; Lingyu Liu; Yaxiong Wang; Yujiao Wu; Li Zhu; Zhedong Zheng

A Large-scale Interpretable Multi-modality Benchmark for Facial Image Forgery Localization

Jingchun Lian, Lingyu Liu, Yaxiong Wang, Yujiao Wu, Li Zhu, Zhedong Zheng

TL;DR

This work tackles the need for interpretable forgery localization beyond binary masks. It introduces the Multi-Modal Tampering Tracing (MMTT) dataset, a large corpus of 128,303 forged facial image-text triplets built from CelebA-HQ and FFHQ with pixel-level masks and detailed textual annotations. The ForgeryTalker framework then jointly localizes forgeries and generates interpretive reports by combining a Forgery Prompter Network, a Mask Decoder, and a multimodal large language model, trained in two stages. Experiments show strong captioning performance and competitive localization accuracy, underscoring the approach's potential to enhance transparency and reliability in digital forensics.

Abstract

Image forgery localization, which centers on identifying tampered pixels within an image, has seen significant advancements. Traditional approaches often model this challenge as a variant of image segmentation, treating the binary segmentation of forged areas as the end product. We argue that the basic binary forgery mask is inadequate for explaining model predictions. It doesn't clarify why the model pinpoints certain areas and treats all forged pixels the same, making it hard to spot the most fake-looking parts. In this study, we mitigate the aforementioned limitations by generating salient region-focused interpretation for the forgery images. To support this, we craft a Multi-Modal Tramper Tracing (MMTT) dataset, comprising facial images manipulated using deepfake techniques and paired with manual, interpretable textual annotations. To harvest high-quality annotation, annotators are instructed to meticulously observe the manipulated images and articulate the typical characteristics of the forgery regions. Subsequently, we collect a dataset of 128,303 image-text pairs. Leveraging the MMTT dataset, we develop ForgeryTalker, an architecture designed for concurrent forgery localization and interpretation. ForgeryTalker first trains a forgery prompter network to identify the pivotal clues within the explanatory text. Subsequently, the region prompter is incorporated into multimodal large language model for finetuning to achieve the dual goals of localization and interpretation. Extensive experiments conducted on the MMTT dataset verify the superior performance of our proposed model. The dataset, code as well as pretrained checkpoints will be made publicly available to facilitate further research and ensure the reproducibility of our results.

A Large-scale Interpretable Multi-modality Benchmark for Facial Image Forgery Localization

TL;DR

Abstract

Paper Structure (17 sections, 7 equations, 4 figures, 4 tables)

This paper contains 17 sections, 7 equations, 4 figures, 4 tables.

Introduction
Related Work
Interpretation Annotation.
Multi-Modal Tramper Tracing dataset
Source Image Collection
Forgery Generation
Dataset Statistics
ForgeryTalker
Architecture
Forgery Prompter Network
Interpretation Generation
Mask Decoder
Experiment
Experimental Setup
Quatitative Results
...and 2 more sections

Figures (4)

Figure 1: The proposed framework combines forgery localization and interpretive analysis. The left panel illustrates dataset construction with Face Swapping and Image Inpainting methods. The right panel defines tasks: forgery localization to identify manipulated regions and forgery interpretation to explain the manipulations, enhancing interpretability.
Figure 2: Annotation pipeline for forgery interpretation. Annotators review the original and forged images ($I_o, I_f$), conduct an Inconsistency Inspection with a Minimum Time Constraint ($\geq$ 1 min), and identify Inconsistent Regions. These regions are used to produce Textual Descriptions within a Maximum Length Constraint ($\leq$ 120 words). Quality Control then screens for false positives (e.g., Ear), ensuring only accurate descriptions are included in the Final Description.
Figure 3: Overview of the MMTT dataset statistics. GAN-FS represents GAN-based Face Swapping, Trans. Inp. denotes Transformer-based Inpainting, and Diff. Inp. refers to Diffusion-based Inpainting. (a) shows the distribution of manipulation methods; (b) shows modified facial part frequency for each inpainting method (excluding GAN-FS, which involves full-face edits); (c) shows the distribution of modified parts per image for Transformer- and Diffusion-based inpainting (excluding GAN-FS due to no localized edits); (d) shows caption length distribution for all methods.
Figure 4: Illustration of our ForgeryTalker. ForgeryTalker extends the InstructBlip framework by incorporating a Forgery Prompter Network (FPN) and a Mask Decoder. The framework processes an image into patch embeddings via a Vision Transformer. These embeddings are refined by the Q-former, whose features then interact with FPN's regional prompts through cross attention for forgery localization in the mask decoder. The FPN generates region prompts, merged with an instruction template and fed into the Q-former. In the second stage, the FPN is frozen, while the mask decoder and Q-former are jointly optimized for segmentation and language generation. The multimodal features are passed to a large language model to produce a descriptive explanation of the forgery.

A Large-scale Interpretable Multi-modality Benchmark for Facial Image Forgery Localization

TL;DR

Abstract

A Large-scale Interpretable Multi-modality Benchmark for Facial Image Forgery Localization

Authors

TL;DR

Abstract

Table of Contents

Figures (4)