Table of Contents
Fetching ...

Context-Aware Weakly Supervised Image Manipulation Localization with SAM Refinement

Xinghao Wang, Tao Gong, Qi Chu, Bin Liu, Nenghai Yu

TL;DR

This work tackles weakly supervised image manipulation localization by learning from image-level labels alone. It introduces Context-Aware Boundary Localization (CABL) to capture boundary context via a Sobel-based edge emphasis, and CAM-Guided SAM Refinement (CGSR) to convert coarse CAMs into precise masks using SAM with informative prompts, all within a dual-branch Transformer-CNN backbone. The model is trained with a simple joint loss $loss = loss_{CABL} + loss_{Trans}$ and achieves state-of-the-art performance on several datasets for both detection and pixel-level localization, while remaining robust to common degradations. The approach reduces annotation burden while delivering high-fidelity localization, with practical implications for defending against manipulated imagery in real-world settings.

Abstract

Malicious image manipulation poses societal risks, increasing the importance of effective image manipulation detection methods. Recent approaches in image manipulation detection have largely been driven by fully supervised approaches, which require labor-intensive pixel-level annotations. Thus, it is essential to explore weakly supervised image manipulation localization methods that only require image-level binary labels for training. However, existing weakly supervised image manipulation methods overlook the importance of edge information for accurate localization, leading to suboptimal localization performance. To address this, we propose a Context-Aware Boundary Localization (CABL) module to aggregate boundary features and learn context-inconsistency for localizing manipulated areas. Furthermore, by leveraging Class Activation Mapping (CAM) and Segment Anything Model (SAM), we introduce the CAM-Guided SAM Refinement (CGSR) module to generate more accurate manipulation localization maps. By integrating two modules, we present a novel weakly supervised framework based on a dual-branch Transformer-CNN architecture. Our method achieves outstanding localization performance across multiple datasets.

Context-Aware Weakly Supervised Image Manipulation Localization with SAM Refinement

TL;DR

This work tackles weakly supervised image manipulation localization by learning from image-level labels alone. It introduces Context-Aware Boundary Localization (CABL) to capture boundary context via a Sobel-based edge emphasis, and CAM-Guided SAM Refinement (CGSR) to convert coarse CAMs into precise masks using SAM with informative prompts, all within a dual-branch Transformer-CNN backbone. The model is trained with a simple joint loss and achieves state-of-the-art performance on several datasets for both detection and pixel-level localization, while remaining robust to common degradations. The approach reduces annotation burden while delivering high-fidelity localization, with practical implications for defending against manipulated imagery in real-world settings.

Abstract

Malicious image manipulation poses societal risks, increasing the importance of effective image manipulation detection methods. Recent approaches in image manipulation detection have largely been driven by fully supervised approaches, which require labor-intensive pixel-level annotations. Thus, it is essential to explore weakly supervised image manipulation localization methods that only require image-level binary labels for training. However, existing weakly supervised image manipulation methods overlook the importance of edge information for accurate localization, leading to suboptimal localization performance. To address this, we propose a Context-Aware Boundary Localization (CABL) module to aggregate boundary features and learn context-inconsistency for localizing manipulated areas. Furthermore, by leveraging Class Activation Mapping (CAM) and Segment Anything Model (SAM), we introduce the CAM-Guided SAM Refinement (CGSR) module to generate more accurate manipulation localization maps. By integrating two modules, we present a novel weakly supervised framework based on a dual-branch Transformer-CNN architecture. Our method achieves outstanding localization performance across multiple datasets.

Paper Structure

This paper contains 17 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The framework of our method: We integrate our custom-designed CABL module and CGSR module into the backbone. We train the model using only images and image-level binary labels. Sobelb24 denotes the Sobel operator, Stem generates patch embeddings, ConvBlock and TransBlockb8 are composed of convolution layers and transformerb30 layers respectively. The green point in the Prompt Generator indicates the positive point prompt, the red point represents the negative point prompt, and the red box denotes the bounding box prompt.
  • Figure 2: Right of vertical line: first row (w/o CABL) shows feature maps of first 4 blocks; second row (w/ CABL) shows enhanced maps. Arrows indicate deeper block directions.
  • Figure 3: Originally designed for three possible CABL structures, structure III was eventually used as CABL
  • Figure 4: Qualitative results on four datasets. From top to bottom: Image, GT, WSCL+MIL-FCN, Ours w/o and w/ SAMb23 refinement.
  • Figure 5: Robustness Evaluation of JPEG Compression and Gaussian Blur on CASIAv1 Dataset.