Table of Contents
Fetching ...

Progressive Feedback-Enhanced Transformer for Image Forgery Localization

Haochen Zhu, Gang Cao, Xianglin Huang

TL;DR

This work tackles the challenging problem of blind image forgery localization, where subtle tampering cues are easily masked by post-processing. It introduces ProFact, a progressive feedback-enhanced Transformer with two cascaded branches: a coarse localization branch that produces $M_c$ and a feedback enhancement branch that refines it to $M_p$, augmented by a Contextual Spatial Pyramid Module and a Holistic Attention Mechanism to fuse coarse guidance with intermediate features. A realistic data generation strategy (MBH) and a two-stage training protocol further improve generalization to real-world forgeries and AI-generated edits. Empirical results across nine public datasets show that ProFact achieves state-of-the-art generalization and robustness, with strong performance under post-processing and across multiple tampering types. This approach offers practical impact for digital authentication by delivering more reliable, scalable forgery localization in diverse scenarios.

Abstract

Blind detection of the forged regions in digital images is an effective authentication means to counter the malicious use of local image editing techniques. Existing encoder-decoder forensic networks overlook the fact that detecting complex and subtle tampered regions typically requires more feedback information. In this paper, we propose a Progressive FeedbACk-enhanced Transformer (ProFact) network to achieve coarse-to-fine image forgery localization. Specifically, the coarse localization map generated by an initial branch network is adaptively fed back to the early transformer encoder layers, which can enhance the representation of positive features while suppressing interference factors. The cascaded transformer network, combined with a contextual spatial pyramid module, is designed to refine discriminative forensic features for improving the forgery localization accuracy and reliability. Furthermore, we present an effective strategy to automatically generate large-scale forged image samples close to real-world forensic scenarios, especially in realistic and coherent processing. Leveraging on such samples, a progressive and cost-effective two-stage training protocol is applied to the ProFact network. The extensive experimental results on nine public forensic datasets show that our proposed localizer greatly outperforms the state-of-the-art on the generalization ability and robustness of image forgery localization. Code will be publicly available at https://github.com/multimediaFor/ProFact.

Progressive Feedback-Enhanced Transformer for Image Forgery Localization

TL;DR

This work tackles the challenging problem of blind image forgery localization, where subtle tampering cues are easily masked by post-processing. It introduces ProFact, a progressive feedback-enhanced Transformer with two cascaded branches: a coarse localization branch that produces and a feedback enhancement branch that refines it to , augmented by a Contextual Spatial Pyramid Module and a Holistic Attention Mechanism to fuse coarse guidance with intermediate features. A realistic data generation strategy (MBH) and a two-stage training protocol further improve generalization to real-world forgeries and AI-generated edits. Empirical results across nine public datasets show that ProFact achieves state-of-the-art generalization and robustness, with strong performance under post-processing and across multiple tampering types. This approach offers practical impact for digital authentication by delivering more reliable, scalable forgery localization in diverse scenarios.

Abstract

Blind detection of the forged regions in digital images is an effective authentication means to counter the malicious use of local image editing techniques. Existing encoder-decoder forensic networks overlook the fact that detecting complex and subtle tampered regions typically requires more feedback information. In this paper, we propose a Progressive FeedbACk-enhanced Transformer (ProFact) network to achieve coarse-to-fine image forgery localization. Specifically, the coarse localization map generated by an initial branch network is adaptively fed back to the early transformer encoder layers, which can enhance the representation of positive features while suppressing interference factors. The cascaded transformer network, combined with a contextual spatial pyramid module, is designed to refine discriminative forensic features for improving the forgery localization accuracy and reliability. Furthermore, we present an effective strategy to automatically generate large-scale forged image samples close to real-world forensic scenarios, especially in realistic and coherent processing. Leveraging on such samples, a progressive and cost-effective two-stage training protocol is applied to the ProFact network. The extensive experimental results on nine public forensic datasets show that our proposed localizer greatly outperforms the state-of-the-art on the generalization ability and robustness of image forgery localization. Code will be publicly available at https://github.com/multimediaFor/ProFact.
Paper Structure (26 sections, 13 equations, 8 figures, 9 tables)

This paper contains 26 sections, 13 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Proposed image forgery localization network ProFact. The coarse localization branch (top) generates a coarse localization map $M_c$ of the input forged image $X$. The feedback enhancement branch (down) predicts the final refined localization map $M_p$ by re-encoding the middle-level features of CLB with the attentive feedback.
  • Figure 2: Effectiveness of FEB verified by the visualized feature maps $X_2$, ${{X'}_2}$, ${M_c}$, HAM($M_c$) and ${M_p}$ on example images. The redder indicates higher responses. From top to bottom, the forgery images are from CASIAv1, NIST16, Coverage and AutoSplice, respectively.
  • Figure 3: Detailed structures of CSPM. The input feature $M$ passes the CoT block for exploring contextual information and is then enhanced by a spatial pyramid of dilated convolutions to output $M"$. $\oplus$, $\otimes$, $\copyright$ denote element-wise addition, multiplication, and concatenation, respectively.
  • Figure 4: Proposed realistic training samples generation method. It successively includes digital matting, the processing chain with scaling, rotation, flipping and deformation, alpha blending, and harmonization.
  • Figure 5: Example forged images generated by simple synthesis and MBH methods, and their corresponding ground-truths. Zoom in for details.
  • ...and 3 more figures