A Lightweight and Effective Image Tampering Localization Network with Vision Mamba
Kun Guo, Gang Cao, Zijie Lou, Xianglin Huang, Jiaoyun Liu
TL;DR
This work tackles blind image tampering localization by proposing ForMa, a lightweight network powered by Vision Mamba that models long-range dependencies with linear complexity. The approach combines a Visual State Space (VSS) encoder with SS2D, a noise-assisted decoding strategy, and a parameter-free pixel shuffle decoder to achieve accurate tampering localization with low computational cost. On 10 cross-domain benchmarks, ForMa attains state-of-the-art averages (F1 64.1%, IoU 56.2%) while requiring only 37M parameters and 42G FLOPs, outperforming both CNN and Transformer baselines. The method demonstrates strong generalization and robustness to post-processing, with code available at the authors' repository for practical deployment.
Abstract
Current image tampering localization methods primarily rely on Convolutional Neural Networks (CNNs) and Transformers. While CNNs suffer from limited local receptive fields, Transformers offer global context modeling at the expense of quadratic computational complexity. Recently, the state space model Mamba has emerged as a competitive alternative, enabling linear-complexity global dependency modeling. Inspired by it, we propose a lightweight and effective FORensic network based on vision MAmba (ForMa) for blind image tampering localization. Firstly, ForMa captures multi-scale global features that achieves efficient global dependency modeling through linear complexity. Then the pixel-wise localization map is generated by a lightweight decoder, which employs a parameter-free pixel shuffle layer for upsampling. Additionally, a noise-assisted decoding strategy is proposed to integrate complementary manipulation traces from tampered images, boosting decoder sensitivity to forgery cues. Experimental results on 10 standard datasets demonstrate that ForMa achieves state-of-the-art generalization ability and robustness, while maintaining the lowest computational complexity. Code is available at https://github.com/multimediaFor/ForMa.
