Table of Contents
Fetching ...

Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach

Lvpan Cai, Haowei Wang, Jiayi Ji, Yanshu Zhoumen, Shen Chen, Taiping Yao, Xiaoshuai Sun

TL;DR

The paper tackles the difficulty of detecting localized AI-generated forgeries by introducing BR-Gen, a 150k-sample dataset that covers scene-level edits in stuff and background regions, built via an automated Perception-Creation-Evaluation pipeline. It further presents NFA-ViT, a noise-guided forgery amplification transformer that diffuses subtle forgery cues across the image through a dual-branch attention mechanism and a learnable decoder. Extensive experiments on BR-Gen show that current methods struggle with these broader edits, while NFA-ViT achieves strong detection and localization performance and generalizes across benchmarks. Together, BR-Gen and NFA-ViT offer a new, challenging platform and a robust method for advancing localized AIGC forgery detection in diverse, real-world scenes.

Abstract

The rise of AI-generated image tools has made localized forgeries increasingly realistic, posing challenges for visual content integrity. Although recent efforts have explored localized AIGC detection, existing datasets predominantly focus on object-level forgeries while overlooking broader scene edits in regions such as sky or ground. To address these limitations, we introduce \textbf{BR-Gen}, a large-scale dataset of 150,000 locally forged images with diverse scene-aware annotations, which are based on semantic calibration to ensure high-quality samples. BR-Gen is constructed through a fully automated ``Perception-Creation-Evaluation'' pipeline to ensure semantic coherence and visual realism. In addition, we further propose \textbf{NFA-ViT}, a Noise-guided Forgery Amplification Vision Transformer that enhances the detection of localized forgeries by amplifying subtle forgery-related features across the entire image. NFA-ViT mines heterogeneous regions in images, \emph{i.e.}, potential edited areas, by noise fingerprints. Subsequently, attention mechanism is introduced to compel the interaction between normal and abnormal features, thereby propagating the traces throughout the entire image, allowing subtle forgeries to influence a broader context and improving overall detection robustness. Extensive experiments demonstrate that BR-Gen constructs entirely new scenarios that are not covered by existing methods. Take a step further, NFA-ViT outperforms existing methods on BR-Gen and generalizes well across current benchmarks.

Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach

TL;DR

The paper tackles the difficulty of detecting localized AI-generated forgeries by introducing BR-Gen, a 150k-sample dataset that covers scene-level edits in stuff and background regions, built via an automated Perception-Creation-Evaluation pipeline. It further presents NFA-ViT, a noise-guided forgery amplification transformer that diffuses subtle forgery cues across the image through a dual-branch attention mechanism and a learnable decoder. Extensive experiments on BR-Gen show that current methods struggle with these broader edits, while NFA-ViT achieves strong detection and localization performance and generalizes across benchmarks. Together, BR-Gen and NFA-ViT offer a new, challenging platform and a robust method for advancing localized AIGC forgery detection in diverse, real-world scenes.

Abstract

The rise of AI-generated image tools has made localized forgeries increasingly realistic, posing challenges for visual content integrity. Although recent efforts have explored localized AIGC detection, existing datasets predominantly focus on object-level forgeries while overlooking broader scene edits in regions such as sky or ground. To address these limitations, we introduce \textbf{BR-Gen}, a large-scale dataset of 150,000 locally forged images with diverse scene-aware annotations, which are based on semantic calibration to ensure high-quality samples. BR-Gen is constructed through a fully automated ``Perception-Creation-Evaluation'' pipeline to ensure semantic coherence and visual realism. In addition, we further propose \textbf{NFA-ViT}, a Noise-guided Forgery Amplification Vision Transformer that enhances the detection of localized forgeries by amplifying subtle forgery-related features across the entire image. NFA-ViT mines heterogeneous regions in images, \emph{i.e.}, potential edited areas, by noise fingerprints. Subsequently, attention mechanism is introduced to compel the interaction between normal and abnormal features, thereby propagating the traces throughout the entire image, allowing subtle forgeries to influence a broader context and improving overall detection robustness. Extensive experiments demonstrate that BR-Gen constructs entirely new scenarios that are not covered by existing methods. Take a step further, NFA-ViT outperforms existing methods on BR-Gen and generalizes well across current benchmarks.

Paper Structure

This paper contains 22 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of four forgery scenarios in existing datasets. They mainly cover full-generated images and object-level forgeries, while forgeries in stuff and background regions remain largely unaddressed. Red regions show ground-truth forgeries. State-of-the-art models (FatFormer liu2024forgery and SparseViT su2024can) struggle with these new cases. Our proposed NFA-ViT achieves robust detection across all four scenarios. The source data comes from open-source datasets GRE sun2024rethinking, COCO lin2014microsoft, and ImageNet deng2009imagenet.
  • Figure 2: The automated pipeline for the BR-Gen dataset consists of three iterative stages: Perception, Creation, and Evaluation. These stages are applied to produce high-quality localized generation datasets through progressive refinement. All samples are sourced from publicly available datasets zhou2017placeslin2014microsoftdeng2009imagenet.
  • Figure 3: Partial examples from BR-Gen dataset. The real data comes from the open datasets Places zhou2017places, COCO lin2014microsoft, and ImageNet deng2009imagenet.
  • Figure 4: The proposed NFA-ViT framework, which contains dual branches of noise and image, uses noise information to guide the focus area of the image. For the image encoder, a sparse attention mechanism is introduced.
  • Figure 5: Localization results of different models. We compared images generated by two types of masks. All samples are sourced from publicly available datasets zhou2017placeslin2014microsoftdeng2009imagenet.