Progressive Feedback-Enhanced Transformer for Image Forgery Localization
Haochen Zhu, Gang Cao, Xianglin Huang
TL;DR
This work tackles the challenging problem of blind image forgery localization, where subtle tampering cues are easily masked by post-processing. It introduces ProFact, a progressive feedback-enhanced Transformer with two cascaded branches: a coarse localization branch that produces $M_c$ and a feedback enhancement branch that refines it to $M_p$, augmented by a Contextual Spatial Pyramid Module and a Holistic Attention Mechanism to fuse coarse guidance with intermediate features. A realistic data generation strategy (MBH) and a two-stage training protocol further improve generalization to real-world forgeries and AI-generated edits. Empirical results across nine public datasets show that ProFact achieves state-of-the-art generalization and robustness, with strong performance under post-processing and across multiple tampering types. This approach offers practical impact for digital authentication by delivering more reliable, scalable forgery localization in diverse scenarios.
Abstract
Blind detection of the forged regions in digital images is an effective authentication means to counter the malicious use of local image editing techniques. Existing encoder-decoder forensic networks overlook the fact that detecting complex and subtle tampered regions typically requires more feedback information. In this paper, we propose a Progressive FeedbACk-enhanced Transformer (ProFact) network to achieve coarse-to-fine image forgery localization. Specifically, the coarse localization map generated by an initial branch network is adaptively fed back to the early transformer encoder layers, which can enhance the representation of positive features while suppressing interference factors. The cascaded transformer network, combined with a contextual spatial pyramid module, is designed to refine discriminative forensic features for improving the forgery localization accuracy and reliability. Furthermore, we present an effective strategy to automatically generate large-scale forged image samples close to real-world forensic scenarios, especially in realistic and coherent processing. Leveraging on such samples, a progressive and cost-effective two-stage training protocol is applied to the ProFact network. The extensive experimental results on nine public forensic datasets show that our proposed localizer greatly outperforms the state-of-the-art on the generalization ability and robustness of image forgery localization. Code will be publicly available at https://github.com/multimediaFor/ProFact.
