Table of Contents
Fetching ...

IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

Changjiang Jiang, Wenhui Dong, Zhonghao Zhang, Chenyang Si, Fengchang Yu, Wei Peng, Xinbin Yuan, Yifei Bi, Ming Zhao, Zian Zhou, Caifeng Shan

TL;DR

This work tackles the scarcity of explainable, multimodal benchmarks for AIGC detection by introducing Ivy-Fake, a large-scale dataset with rich, multi-dimensional explanations for images and videos, and Ivy-xDetector, a reinforcement-learning–driven, explainable detector built on GRPO. Ivy-Fake combines diverse public and synthetic sources with stringent quality control and a two-tier annotation scheme to enable transparent evaluation of detection and reasoning. The two-stage training—instruction-driven initialization followed by GRPO-based fine-tuning—yields state-of-the-art accuracy on GenImage and GenVideo while requiring far fewer training samples than prior methods. The results underscore the value of integrated multimodal benchmarks and explainable AI for robust detection and trustworthy provenance in synthetic media.

Abstract

The rapid development of Artificial Intelligence Generated Content (AIGC) techniques has enabled the creation of high-quality synthetic content, but it also raises significant security concerns. Current detection methods face two major limitations: (1) the lack of multidimensional explainable datasets for generated images and videos. Existing open-source datasets (e.g., WildFake, GenVideo) rely on oversimplified binary annotations, which restrict the explainability and trustworthiness of trained detectors. (2) Prior MLLM-based forgery detectors (e.g., FakeVLM) exhibit insufficiently fine-grained interpretability in their step-by-step reasoning, which hinders reliable localization and explanation. To address these challenges, we introduce Ivy-Fake, the first large-scale multimodal benchmark for explainable AIGC detection. It consists of over 106K richly annotated training samples (images and videos) and 5,000 manually verified evaluation examples, sourced from multiple generative models and real world datasets through a carefully designed pipeline to ensure both diversity and quality. Furthermore, we propose Ivy-xDetector, a reinforcement learning model based on Group Relative Policy Optimization (GRPO), capable of producing explainable reasoning chains and achieving robust performance across multiple synthetic content detection benchmarks. Extensive experiments demonstrate the superiority of our dataset and confirm the effectiveness of our approach. Notably, our method improves performance on GenImage from 86.88% to 96.32%, surpassing prior state-of-the-art methods by a clear margin.

IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

TL;DR

This work tackles the scarcity of explainable, multimodal benchmarks for AIGC detection by introducing Ivy-Fake, a large-scale dataset with rich, multi-dimensional explanations for images and videos, and Ivy-xDetector, a reinforcement-learning–driven, explainable detector built on GRPO. Ivy-Fake combines diverse public and synthetic sources with stringent quality control and a two-tier annotation scheme to enable transparent evaluation of detection and reasoning. The two-stage training—instruction-driven initialization followed by GRPO-based fine-tuning—yields state-of-the-art accuracy on GenImage and GenVideo while requiring far fewer training samples than prior methods. The results underscore the value of integrated multimodal benchmarks and explainable AI for robust detection and trustworthy provenance in synthetic media.

Abstract

The rapid development of Artificial Intelligence Generated Content (AIGC) techniques has enabled the creation of high-quality synthetic content, but it also raises significant security concerns. Current detection methods face two major limitations: (1) the lack of multidimensional explainable datasets for generated images and videos. Existing open-source datasets (e.g., WildFake, GenVideo) rely on oversimplified binary annotations, which restrict the explainability and trustworthiness of trained detectors. (2) Prior MLLM-based forgery detectors (e.g., FakeVLM) exhibit insufficiently fine-grained interpretability in their step-by-step reasoning, which hinders reliable localization and explanation. To address these challenges, we introduce Ivy-Fake, the first large-scale multimodal benchmark for explainable AIGC detection. It consists of over 106K richly annotated training samples (images and videos) and 5,000 manually verified evaluation examples, sourced from multiple generative models and real world datasets through a carefully designed pipeline to ensure both diversity and quality. Furthermore, we propose Ivy-xDetector, a reinforcement learning model based on Group Relative Policy Optimization (GRPO), capable of producing explainable reasoning chains and achieving robust performance across multiple synthetic content detection benchmarks. Extensive experiments demonstrate the superiority of our dataset and confirm the effectiveness of our approach. Notably, our method improves performance on GenImage from 86.88% to 96.32%, surpassing prior state-of-the-art methods by a clear margin.

Paper Structure

This paper contains 40 sections, 3 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Overview of the Ivy-Fake framework: By conducting in-depth analysis of temporal and spatial artifacts, the framework enables explainable detection of AI-generated content.
  • Figure 2: Token Length Distributions and Multi-Dimensional Coverage Across Datasets. Left: Distribution of token lengths across datasets; Right: Coverage of multiple dimensions in explainability datasets, extracted using Qwen3-32B qwen3. The Prompt can be seen in appendix.
  • Figure 3: Comparison between Ivy-Fake and FakeVLM fakeclue_wen2025spot (NeurIPS 2025). The Ivy-Fake dataset provides richer and more fine-grained interpretability dimensions.
  • Figure 4: Overview of the three-stage training pipeline for Ivy-xDetector, including general video understanding, detection instruction tuning, and interpretability instruction tuning.
  • Figure 5: Ivy-xDetector performance on video detection. The model provides fine-grained spatiotemporal reasoning chains for explainable analysis.
  • ...and 11 more figures