Table of Contents
Fetching ...

DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline

Rui Zhang, Hongxia Wang, Hangqing Liu, Yang Zhou, Qiang Zeng

TL;DR

This work tackles the localization of diffusion-based image edits by introducing DEAL-300K, a large-scale dataset annotated via an automated pipeline combining multimodal instruction generation, mask-free editing, and active-learning change detection. It then presents MFPT, a framework that freezes Visual Foundation Models and augments them with Frequency Input Prompters and Feature Frequency Prompters to capture semantic and frequency-domain cues for precise pixel-level localization. On DEAL-300K and external benchmarks, MFPT achieves state-of-the-art localization performance and demonstrates robustness to JPEG compression and blur, with strong cross-domain generalization. The authors also emphasize a scalable, automated annotation workflow and propose future directions toward video manipulation localization.

Abstract

Diffusion-based image editing has made semantic level image manipulation easy for general users, but it also enables realistic local forgeries that are hard to localize. Existing benchmarks mainly focus on the binary detection of generated images or the localization of manually edited regions and do not reflect the properties of diffusion-based edits, which often blend smoothly into the original content. We present Diffusion-Based Image Editing Area Localization Dataset (DEAL-300K), a large scale dataset for diffusion-based image manipulation localization (DIML) with more than 300,000 annotated images. We build DEAL-300K by using a multi-modal large language model to generate editing instructions, a mask-free diffusion editor to produce manipulated images, and an active-learning change detection pipeline to obtain pixel-level annotations. On top of this dataset, we propose a localization framework that uses a frozen Visual Foundation Model (VFM) together with Multi Frequency Prompt Tuning (MFPT) to capture both semantic and frequency-domain cues of edited regions. Trained on DEAL-300K, our method reaches a pixel-level F1 score of 82.56% on our test split and 80.97% on the external CoCoGlide benchmark, providing strong baselines and a practical foundation for future DIML research.The dataset can be accessed via https://github.com/ymhzyj/DEAL-300K.

DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline

TL;DR

This work tackles the localization of diffusion-based image edits by introducing DEAL-300K, a large-scale dataset annotated via an automated pipeline combining multimodal instruction generation, mask-free editing, and active-learning change detection. It then presents MFPT, a framework that freezes Visual Foundation Models and augments them with Frequency Input Prompters and Feature Frequency Prompters to capture semantic and frequency-domain cues for precise pixel-level localization. On DEAL-300K and external benchmarks, MFPT achieves state-of-the-art localization performance and demonstrates robustness to JPEG compression and blur, with strong cross-domain generalization. The authors also emphasize a scalable, automated annotation workflow and propose future directions toward video manipulation localization.

Abstract

Diffusion-based image editing has made semantic level image manipulation easy for general users, but it also enables realistic local forgeries that are hard to localize. Existing benchmarks mainly focus on the binary detection of generated images or the localization of manually edited regions and do not reflect the properties of diffusion-based edits, which often blend smoothly into the original content. We present Diffusion-Based Image Editing Area Localization Dataset (DEAL-300K), a large scale dataset for diffusion-based image manipulation localization (DIML) with more than 300,000 annotated images. We build DEAL-300K by using a multi-modal large language model to generate editing instructions, a mask-free diffusion editor to produce manipulated images, and an active-learning change detection pipeline to obtain pixel-level annotations. On top of this dataset, we propose a localization framework that uses a frozen Visual Foundation Model (VFM) together with Multi Frequency Prompt Tuning (MFPT) to capture both semantic and frequency-domain cues of edited regions. Trained on DEAL-300K, our method reaches a pixel-level F1 score of 82.56% on our test split and 80.97% on the external CoCoGlide benchmark, providing strong baselines and a practical foundation for future DIML research.The dataset can be accessed via https://github.com/ymhzyj/DEAL-300K.

Paper Structure

This paper contains 32 sections, 3 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Sample images and annotations from the proposed DEAL-300K dataset. From left to right: the original image, its edited version, and the annotations. Below each set is the instruction used for editing. Red indicates the edited areas.
  • Figure 2: MLLM-driven workflow for generating edited images. SFT refers to Supervised Fine-Tuning.
  • Figure 3: Comparison of automated annotations with ground truth from CoCoGlide DBLP:conf/cvpr/GuillaroCSDV23, including subtraction analysis.
  • Figure 4: Examples from the DEAL-300K generation pipeline. Qwen-VL generates editing instructions, InstructPix2Pix produces edited images, and SAM-CD generates pixel-level masks.
  • Figure 5: Word cloud of editing instructions.
  • ...and 7 more figures