Table of Contents
Fetching ...

From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations

Zhiqing Guo, Dongdong Xi, Songlin Li, Gaobo Yang

TL;DR

The paper tackles the challenge of achieving fine-grained image manipulation localization without expensive pixel-level annotations. It introduces BoxPromptIML, a coarse-to-fine weakly supervised framework that uses coarse bounding boxes as prompts and a frozen SAM to generate high-quality pseudo masks, which guide a lightweight student via knowledge distillation. A Memory-Guided Gated Fusion Module stores prototypical tampering patterns and fuses multi-scale features under dual guidance from real-time context and memory priors, boosting localization robustness. Experiments on in-distribution and out-of-distribution data show competitive performance with fully supervised methods while requiring far less annotation and enabling efficient deployment, including resilience to social-media compression. This approach provides a scalable, practical solution for forensics and media integrity tasks where annotation budgets are constrained and generalization to new domains is critical.

Abstract

Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. Existing fully-supervised IML methods depend heavily on dense pixel-level mask annotations, which limits scalability to large datasets or real-world deployment.In contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization. To address this dilemma, we propose BoxPromptIML, a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. Specifically, we propose a coarse region annotation strategy, which can generate relatively accurate manipulation masks at lower cost. To improve model efficiency and facilitate deployment, we further design an efficient lightweight student model, which learns to perform fine-grained localization through knowledge distillation from a fixed teacher model based on the Segment Anything Model (SAM). Moreover, inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled prototypical patterns with real-time observational cues derived from the input. Instead of passive feature extraction, this strategy enables a dynamic process of knowledge recollection, where long-term memory is adapted to the specific context of the current image, significantly enhancing localization accuracy and robustness. Extensive experiments across both in-distribution and out-of-distribution datasets show that BoxPromptIML outperforms or rivals fully-supervised models, while maintaining strong generalization, low annotation cost, and efficient deployment characteristics.

From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations

TL;DR

The paper tackles the challenge of achieving fine-grained image manipulation localization without expensive pixel-level annotations. It introduces BoxPromptIML, a coarse-to-fine weakly supervised framework that uses coarse bounding boxes as prompts and a frozen SAM to generate high-quality pseudo masks, which guide a lightweight student via knowledge distillation. A Memory-Guided Gated Fusion Module stores prototypical tampering patterns and fuses multi-scale features under dual guidance from real-time context and memory priors, boosting localization robustness. Experiments on in-distribution and out-of-distribution data show competitive performance with fully supervised methods while requiring far less annotation and enabling efficient deployment, including resilience to social-media compression. This approach provides a scalable, practical solution for forensics and media integrity tasks where annotation budgets are constrained and generalization to new domains is critical.

Abstract

Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. Existing fully-supervised IML methods depend heavily on dense pixel-level mask annotations, which limits scalability to large datasets or real-world deployment.In contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization. To address this dilemma, we propose BoxPromptIML, a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. Specifically, we propose a coarse region annotation strategy, which can generate relatively accurate manipulation masks at lower cost. To improve model efficiency and facilitate deployment, we further design an efficient lightweight student model, which learns to perform fine-grained localization through knowledge distillation from a fixed teacher model based on the Segment Anything Model (SAM). Moreover, inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled prototypical patterns with real-time observational cues derived from the input. Instead of passive feature extraction, this strategy enables a dynamic process of knowledge recollection, where long-term memory is adapted to the specific context of the current image, significantly enhancing localization accuracy and robustness. Extensive experiments across both in-distribution and out-of-distribution datasets show that BoxPromptIML outperforms or rivals fully-supervised models, while maintaining strong generalization, low annotation cost, and efficient deployment characteristics.

Paper Structure

This paper contains 21 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of annotation cost and supervision quality in different IML paradigms. Pixel-level masks take 23 minutes per image, image-level labels require 4 seconds. Our IML framework from coarse to fine adopts rough boxes which only takes 7 seconds to annotate, while retaining spatial clues.
  • Figure 2: Overview of the proposed framework for fine-grained manipulation localization using coarse prompts. A frozen SAM generates pseudo-masks from coarse annotations (e.g., bounding boxes) as soft supervision. The student model learns to replicate these masks via knowledge distillation. To improve localization, we design a Memory-Guided Gated Fusion Module (MGFM) that fuses multi-scale features with guidance from both real-time and memory-recalled signals. This enables refined mask prediction using only coarse supervision.
  • Figure 3: Qualitative comparison of manipulation localization results on both IND (top two rows) and OOD (bottom three rows) examples.