From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations

Zhiqing Guo; Dongdong Xi; Songlin Li; Gaobo Yang

From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations

Zhiqing Guo, Dongdong Xi, Songlin Li, Gaobo Yang

TL;DR

The paper tackles the challenge of achieving fine-grained image manipulation localization without expensive pixel-level annotations. It introduces BoxPromptIML, a coarse-to-fine weakly supervised framework that uses coarse bounding boxes as prompts and a frozen SAM to generate high-quality pseudo masks, which guide a lightweight student via knowledge distillation. A Memory-Guided Gated Fusion Module stores prototypical tampering patterns and fuses multi-scale features under dual guidance from real-time context and memory priors, boosting localization robustness. Experiments on in-distribution and out-of-distribution data show competitive performance with fully supervised methods while requiring far less annotation and enabling efficient deployment, including resilience to social-media compression. This approach provides a scalable, practical solution for forensics and media integrity tasks where annotation budgets are constrained and generalization to new domains is critical.

Abstract

Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. Existing fully-supervised IML methods depend heavily on dense pixel-level mask annotations, which limits scalability to large datasets or real-world deployment.In contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization. To address this dilemma, we propose BoxPromptIML, a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. Specifically, we propose a coarse region annotation strategy, which can generate relatively accurate manipulation masks at lower cost. To improve model efficiency and facilitate deployment, we further design an efficient lightweight student model, which learns to perform fine-grained localization through knowledge distillation from a fixed teacher model based on the Segment Anything Model (SAM). Moreover, inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled prototypical patterns with real-time observational cues derived from the input. Instead of passive feature extraction, this strategy enables a dynamic process of knowledge recollection, where long-term memory is adapted to the specific context of the current image, significantly enhancing localization accuracy and robustness. Extensive experiments across both in-distribution and out-of-distribution datasets show that BoxPromptIML outperforms or rivals fully-supervised models, while maintaining strong generalization, low annotation cost, and efficient deployment characteristics.

From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations

TL;DR

Abstract

From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)