Table of Contents
Fetching ...

Zoom-In to Sort AI-Generated Images Out

Yikun Ji, Yan Hong, Bowen Deng, jun lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang

TL;DR

This paper introduces ZoomIn, a two-stage Vision-Language Model for forensic detection of AI-generated images that first scans globally to locate suspicious regions and then analyzes zoomed-in crops with grounded explanations. To train such a system, it presents MagniFake, a 20k-image dataset with bounding boxes and forensic explanations generated via an automated VLM-based pipeline. The method receives strong empirical support, achieving state-of-the-art accuracy on MagniFake and robust OoD generalization while providing interpretable, region-grounded reasoning. The work also details a grounded data annotation pipeline and reinforcement-learning-based training (GRPO) to optimize both localization and explanation quality, highlighting the practical impact of interpretable AI for digital forensics.

Abstract

The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising critical concerns for digital integrity. Vision-language models (VLMs) offer interpretability through explanations but often fail to detect subtle artifacts in high-quality synthetic images. We propose ZoomIn, a two-stage forensic framework that improves both accuracy and interpretability. Mimicking human visual inspection, ZoomIn first scans an image to locate suspicious regions and then performs a focused analysis on these zoomed-in areas to deliver a grounded verdict. To support training, we introduce MagniFake, a dataset of 20,000 real and high-quality synthetic images annotated with bounding boxes and forensic explanations, generated through an automated VLM-based pipeline. Our method achieves 96.39% accuracy with robust generalization, while providing human-understandable explanations grounded in visual evidence.

Zoom-In to Sort AI-Generated Images Out

TL;DR

This paper introduces ZoomIn, a two-stage Vision-Language Model for forensic detection of AI-generated images that first scans globally to locate suspicious regions and then analyzes zoomed-in crops with grounded explanations. To train such a system, it presents MagniFake, a 20k-image dataset with bounding boxes and forensic explanations generated via an automated VLM-based pipeline. The method receives strong empirical support, achieving state-of-the-art accuracy on MagniFake and robust OoD generalization while providing interpretable, region-grounded reasoning. The work also details a grounded data annotation pipeline and reinforcement-learning-based training (GRPO) to optimize both localization and explanation quality, highlighting the practical impact of interpretable AI for digital forensics.

Abstract

The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising critical concerns for digital integrity. Vision-language models (VLMs) offer interpretability through explanations but often fail to detect subtle artifacts in high-quality synthetic images. We propose ZoomIn, a two-stage forensic framework that improves both accuracy and interpretability. Mimicking human visual inspection, ZoomIn first scans an image to locate suspicious regions and then performs a focused analysis on these zoomed-in areas to deliver a grounded verdict. To support training, we introduce MagniFake, a dataset of 20,000 real and high-quality synthetic images annotated with bounding boxes and forensic explanations, generated through an automated VLM-based pipeline. Our method achieves 96.39% accuracy with robust generalization, while providing human-understandable explanations grounded in visual evidence.

Paper Structure

This paper contains 44 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: (a) Without revisiting specific details, VLMs may overlook critical cues and produce false reasoning with incorrect decisions. (b) Our two-stage ZoomIn pipeline. The VLM first performs a global scan to query region(s) of interest (Query 1), then analyzes the cropped regions for a detailed, final verdict with grounded explanations (Query 2, "Local Evidence Check").
  • Figure 2: The proposed data annotation pipeline. We ask the forensics expert VLM in Query 1 "Explanation Generation" to identify key reasons that make this image look real or AI-generated, followed by Query 2 "Spatial Grounding", which uses the explanation to extract bounding boxes.
  • Figure 3: Examples from the test set of MagniFake, captions are summarized from the Query 2 response, generated by ZoomIn-32B.
  • Figure 4: The number of bounding boxes in Query 1 for ZoomIn-32B/7B models.
  • Figure 5: The relation of accuracy with regard to the number of detected bounding boxes.
  • ...and 6 more figures