Table of Contents
Fetching ...

A Lightweight and Effective Image Tampering Localization Network with Vision Mamba

Kun Guo, Gang Cao, Zijie Lou, Xianglin Huang, Jiaoyun Liu

TL;DR

This work tackles blind image tampering localization by proposing ForMa, a lightweight network powered by Vision Mamba that models long-range dependencies with linear complexity. The approach combines a Visual State Space (VSS) encoder with SS2D, a noise-assisted decoding strategy, and a parameter-free pixel shuffle decoder to achieve accurate tampering localization with low computational cost. On 10 cross-domain benchmarks, ForMa attains state-of-the-art averages (F1 64.1%, IoU 56.2%) while requiring only 37M parameters and 42G FLOPs, outperforming both CNN and Transformer baselines. The method demonstrates strong generalization and robustness to post-processing, with code available at the authors' repository for practical deployment.

Abstract

Current image tampering localization methods primarily rely on Convolutional Neural Networks (CNNs) and Transformers. While CNNs suffer from limited local receptive fields, Transformers offer global context modeling at the expense of quadratic computational complexity. Recently, the state space model Mamba has emerged as a competitive alternative, enabling linear-complexity global dependency modeling. Inspired by it, we propose a lightweight and effective FORensic network based on vision MAmba (ForMa) for blind image tampering localization. Firstly, ForMa captures multi-scale global features that achieves efficient global dependency modeling through linear complexity. Then the pixel-wise localization map is generated by a lightweight decoder, which employs a parameter-free pixel shuffle layer for upsampling. Additionally, a noise-assisted decoding strategy is proposed to integrate complementary manipulation traces from tampered images, boosting decoder sensitivity to forgery cues. Experimental results on 10 standard datasets demonstrate that ForMa achieves state-of-the-art generalization ability and robustness, while maintaining the lowest computational complexity. Code is available at https://github.com/multimediaFor/ForMa.

A Lightweight and Effective Image Tampering Localization Network with Vision Mamba

TL;DR

This work tackles blind image tampering localization by proposing ForMa, a lightweight network powered by Vision Mamba that models long-range dependencies with linear complexity. The approach combines a Visual State Space (VSS) encoder with SS2D, a noise-assisted decoding strategy, and a parameter-free pixel shuffle decoder to achieve accurate tampering localization with low computational cost. On 10 cross-domain benchmarks, ForMa attains state-of-the-art averages (F1 64.1%, IoU 56.2%) while requiring only 37M parameters and 42G FLOPs, outperforming both CNN and Transformer baselines. The method demonstrates strong generalization and robustness to post-processing, with code available at the authors' repository for practical deployment.

Abstract

Current image tampering localization methods primarily rely on Convolutional Neural Networks (CNNs) and Transformers. While CNNs suffer from limited local receptive fields, Transformers offer global context modeling at the expense of quadratic computational complexity. Recently, the state space model Mamba has emerged as a competitive alternative, enabling linear-complexity global dependency modeling. Inspired by it, we propose a lightweight and effective FORensic network based on vision MAmba (ForMa) for blind image tampering localization. Firstly, ForMa captures multi-scale global features that achieves efficient global dependency modeling through linear complexity. Then the pixel-wise localization map is generated by a lightweight decoder, which employs a parameter-free pixel shuffle layer for upsampling. Additionally, a noise-assisted decoding strategy is proposed to integrate complementary manipulation traces from tampered images, boosting decoder sensitivity to forgery cues. Experimental results on 10 standard datasets demonstrate that ForMa achieves state-of-the-art generalization ability and robustness, while maintaining the lowest computational complexity. Code is available at https://github.com/multimediaFor/ForMa.

Paper Structure

This paper contains 11 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of average F1 across 10 standard benchmark datasets and FLOPs (calculated in 512 × 512 input size) with model parameters. Our method achieves the best F1 (64.1%) with the lowest FLOPs (42G) and parameters (37M).
  • Figure 2: (a) Architecture of proposed ForMa. $L_i$={2, 2, 9, 2}. Linear, Conv, and Shuffle refers to the linear, convolution and pixel shuffle layers, respectively. $\oplus$ represents element-wise addition. (b) Structure of the VSS Block. It includes a depthwise convolutional layar (DWConv), SiLU activation function, SS2D module, and linear normalization (LN). The VSS Block, Shuffle-based Decoder, and Noise Extractor are learnable.
  • Figure 3: Illustration of the SS2D module. The S6 used is from Mambagu2023mamba.
  • Figure 4: Results on example images from NIST, Korus and CASIAv1 datasets. From left to right: tampered images, ground truth, localization results from CAT-Net (the best CNN-based method), TruFor (the best Transformer-based method) and ForMa.
  • Figure 5: Robustness evaluation against different image post-processing techniques on Columbia dataset.