Table of Contents
Fetching ...

Can We Get Rid of Handcrafted Feature Extractors? SparseViT: Nonsemantics-Centered, Parameter-Efficient Image Manipulation Localization through Spare-Coding Transformer

Lei Su, Xiaochen Ma, Xuekang Zhu, Chaoqun Niu, Zeyu Lei, Ji-Zhe Zhou

TL;DR

This work tackles the limitation of handcrafted non-semantic feature extractors in Image Manipulation Localization (IML) by introducing SparseViT, a Vision Transformer that replaces dense global self-attention with sparse self-attention to focus on manipulation-sensitive, non-semantic cues. It leverages a sparsity rate $\mathcal{S}$ to partition feature maps into blocks and computes attention within blocks, thereby reducing FLOPs and suppressing semantic encoding. A multi-scale sparsity strategy with explicit schedules $S3_{\mathcal{S}}^{b_i}$ and $S4_{\mathcal{S}}^{b_i}$ enables robust non-semantic feature extraction across scales, complemented by a Learnable Feature Fusion (LFF) head that adaptively fuses multi-scale features via learnable weights $\gamma$. Across multiple public IML benchmarks, SparseViT achieves state-of-the-art pixel-level F1 and AUC while delivering substantial parameter efficiency (e.g., $<80\%$ of prior-model FLOPs) and strong cross-dataset generalization, underscoring the potential of adaptive non-semantic feature learning for robust and scalable IML.

Abstract

Non-semantic features or semantic-agnostic features, which are irrelevant to image context but sensitive to image manipulations, are recognized as evidential to Image Manipulation Localization (IML). Since manual labels are impossible, existing works rely on handcrafted methods to extract non-semantic features. Handcrafted non-semantic features jeopardize IML model's generalization ability in unseen or complex scenarios. Therefore, for IML, the elephant in the room is: How to adaptively extract non-semantic features? Non-semantic features are context-irrelevant and manipulation-sensitive. That is, within an image, they are consistent across patches unless manipulation occurs. Then, spare and discrete interactions among image patches are sufficient for extracting non-semantic features. However, image semantics vary drastically on different patches, requiring dense and continuous interactions among image patches for learning semantic representations. Hence, in this paper, we propose a Sparse Vision Transformer (SparseViT), which reformulates the dense, global self-attention in ViT into a sparse, discrete manner. Such sparse self-attention breaks image semantics and forces SparseViT to adaptively extract non-semantic features for images. Besides, compared with existing IML models, the sparse self-attention mechanism largely reduced the model size (max 80% in FLOPs), achieving stunning parameter efficiency and computation reduction. Extensive experiments demonstrate that, without any handcrafted feature extractors, SparseViT is superior in both generalization and efficiency across benchmark datasets.

Can We Get Rid of Handcrafted Feature Extractors? SparseViT: Nonsemantics-Centered, Parameter-Efficient Image Manipulation Localization through Spare-Coding Transformer

TL;DR

This work tackles the limitation of handcrafted non-semantic feature extractors in Image Manipulation Localization (IML) by introducing SparseViT, a Vision Transformer that replaces dense global self-attention with sparse self-attention to focus on manipulation-sensitive, non-semantic cues. It leverages a sparsity rate to partition feature maps into blocks and computes attention within blocks, thereby reducing FLOPs and suppressing semantic encoding. A multi-scale sparsity strategy with explicit schedules and enables robust non-semantic feature extraction across scales, complemented by a Learnable Feature Fusion (LFF) head that adaptively fuses multi-scale features via learnable weights . Across multiple public IML benchmarks, SparseViT achieves state-of-the-art pixel-level F1 and AUC while delivering substantial parameter efficiency (e.g., of prior-model FLOPs) and strong cross-dataset generalization, underscoring the potential of adaptive non-semantic feature learning for robust and scalable IML.

Abstract

Non-semantic features or semantic-agnostic features, which are irrelevant to image context but sensitive to image manipulations, are recognized as evidential to Image Manipulation Localization (IML). Since manual labels are impossible, existing works rely on handcrafted methods to extract non-semantic features. Handcrafted non-semantic features jeopardize IML model's generalization ability in unseen or complex scenarios. Therefore, for IML, the elephant in the room is: How to adaptively extract non-semantic features? Non-semantic features are context-irrelevant and manipulation-sensitive. That is, within an image, they are consistent across patches unless manipulation occurs. Then, spare and discrete interactions among image patches are sufficient for extracting non-semantic features. However, image semantics vary drastically on different patches, requiring dense and continuous interactions among image patches for learning semantic representations. Hence, in this paper, we propose a Sparse Vision Transformer (SparseViT), which reformulates the dense, global self-attention in ViT into a sparse, discrete manner. Such sparse self-attention breaks image semantics and forces SparseViT to adaptively extract non-semantic features for images. Besides, compared with existing IML models, the sparse self-attention mechanism largely reduced the model size (max 80% in FLOPs), achieving stunning parameter efficiency and computation reduction. Extensive experiments demonstrate that, without any handcrafted feature extractors, SparseViT is superior in both generalization and efficiency across benchmark datasets.

Paper Structure

This paper contains 21 sections, 7 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: SparseViT. SparseViT consists of two key components: an encoder with a sparse self-attention mechanism and a prediction head (LFF) for multi-scale feature fusion. More detailed information about each module will be presented in Method.
  • Figure 2: Sparse Self-Attention. A diagram illustrating the calculation of sparse attention. The self-attention computation occurs only between image patches of the same color.
  • Figure 3: The Structure of LFF. By introducing learnable parameters $\gamma$, LFF dynamically adjusts the contribution of each feature map channel to the fusion result.
  • Figure 4: We select an anchor point in the manipulation region and observe how other labels contribute to its attention. After sparsification, the anchor point’s attention focuses more on the manipulation-related edge regions containing non-semantic information, rather than on the surrounding semantic regions.
  • Figure 5: IML by the SoTA. Existing models exhibit noticeable semantic-related false positives in the last three rows. Our model, SparseViT, effectively ignores semantic-related distractions through its unique sparse self-attention mechanism, focusing on capturing features that are unrelated to the semantic content but crucial to the integrity of the image.
  • ...and 3 more figures