Table of Contents
Fetching ...

Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization

Rui Chen, Bin Liu, Changtao Miao, Xinghao Wang, Yi Li, Tao Gong, Qi Chu, Nenghai Yu

TL;DR

This work tackles image manipulation localization without dense pixel-level annotations by introducing ICFC, a training-free framework that combines Rule Decomposition and Filtering (RDF) with Objectified Rule Sets and Multi-step Progressive Reasoning (MPR) to guide multi-modal language models. RDF converts vague forensic cues into interpretable rules, filtered by CLIP to provide relevant priors, while MPR mirrors expert workflows to produce coarse bounding boxes refined through iterative reasoning and SAM-based pixel-level segmentation, along with human-readable explanations. The approach yields image-level judgments, fine-grained localization, and interpretable forensic rationales, achieving state-of-the-art performance among training-free methods and competitive results with weakly and fully supervised systems across six benchmarks. The findings highlight the potential of knowledge-guided, training-free paradigms for scalable, interpretable image forensics in practical security contexts.

Abstract

Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.

Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization

TL;DR

This work tackles image manipulation localization without dense pixel-level annotations by introducing ICFC, a training-free framework that combines Rule Decomposition and Filtering (RDF) with Objectified Rule Sets and Multi-step Progressive Reasoning (MPR) to guide multi-modal language models. RDF converts vague forensic cues into interpretable rules, filtered by CLIP to provide relevant priors, while MPR mirrors expert workflows to produce coarse bounding boxes refined through iterative reasoning and SAM-based pixel-level segmentation, along with human-readable explanations. The approach yields image-level judgments, fine-grained localization, and interpretable forensic rationales, achieving state-of-the-art performance among training-free methods and competitive results with weakly and fully supervised systems across six benchmarks. The findings highlight the potential of knowledge-guided, training-free paradigms for scalable, interpretable image forensics in practical security contexts.

Abstract

Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.

Paper Structure

This paper contains 9 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Manipulation localization comparison. From top to bottom: tampered images, Mantra-Net mantra (fully-supervised), WSCL wscl (weakly-supervised), NOI1 noi1 and DiffusionIML ref21 (training-free), our method (training-free), and ground-truth. Existing methods struggle with generalization and semantic-level manipulations, while our approach accurately delineates manipulation boundaries.
  • Figure 2: Overall workflow of the proposed framework. Seed rules are refined into an Objectified Rule Set (ORS) by LLMs and experts, then filtered with CLIP for relevance. Guided by these rules, the MLLM generates reasoning messages with bounding boxes, uses the Crop tool to extract regions (Vision), and employs these as inputs for subsequent reasoning steps. Red messages denote initial proposals of potentially tampered areas. Green messages indicate the selected most suspicious region. Blue messages represent the final refinement, whose bounding box is passed to SAMsam for pixel-level segmentation. The pipeline outputs image-level labels, pixel-level localization, and human-interpretable forensic explanations. Purple dashed lines depict the reasoning trajectory of the MLLM throughout the pipeline.
  • Figure 3: Ablation study on the effect of the number of reasoning steps in MPR ($n$ in Eq. \ref{['eq:2']}) on CASIAv1, measured by P-AUC. Performance improves noticeably up to $n=2$ and then remains stable with further steps.