Table of Contents
Fetching ...

Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images

Chuangchuang Tan, Xiang Ming, Jinglu Wang, Renshuai Tao, Bin Li, Yunchao Wei, Yao Zhao, Yan Lu

TL;DR

This work tackles semantic-level inconsistencies in AI-generated imagery by formalizing semantic anomaly detection and reasoning and introducing the AnomReason benchmark, built via a modular multi-agent framework (AnomAgent) with lightweight HITL verification. The pipeline yields structured anomaly annotations (Name, Phenomenon, Reasoning, Severity) and enables precise evaluation with SemAP and SemF1 metrics. A strong result is the AnomReasonor-7B model fine-tuned on this supervision, achieving state-of-the-art SemAP scores and competitive explainable deepfake detection performance, surpassing many open baselines and approaching proprietary systems on reasoning tasks. The framework enables applications in explainable deepfake detection and semantic reasonableness assessment, offering a scalable, interpretable pathway for auditing AIGC for plausibility and trustworthiness, with plans to extend to video and further improve annotation quality.

Abstract

The rapid advancement of AI-generated content (AIGC) has enabled the synthesis of visually convincing images; however, many such outputs exhibit subtle \textbf{semantic anomalies}, including unrealistic object configurations, violations of physical laws, or commonsense inconsistencies, which compromise the overall plausibility of the generated scenes. Detecting these semantic-level anomalies is essential for assessing the trustworthiness of AIGC media, especially in AIGC image analysis, explainable deepfake detection and semantic authenticity assessment. In this paper, we formalize \textbf{semantic anomaly detection and reasoning} for AIGC images and introduce \textbf{AnomReason}, a large-scale benchmark with structured annotations as quadruples \emph{(Name, Phenomenon, Reasoning, Severity)}. Annotations are produced by a modular multi-agent pipeline (\textbf{AnomAgent}) with lightweight human-in-the-loop verification, enabling scale while preserving quality. At construction time, AnomAgent processed approximately 4.17\,B GPT-4o tokens, providing scale evidence for the resulting structured annotations. We further show that models fine-tuned on AnomReason achieve consistent gains over strong vision-language baselines under our proposed semantic matching metric (\textit{SemAP} and \textit{SemF1}). Applications to {explainable deepfake detection} and {semantic reasonableness assessment of image generators} demonstrate practical utility. In summary, AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images. We will release code, metrics, data, and task-aligned models to support reproducible research on semantic authenticity and interpretable AIGC forensics.

Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images

TL;DR

This work tackles semantic-level inconsistencies in AI-generated imagery by formalizing semantic anomaly detection and reasoning and introducing the AnomReason benchmark, built via a modular multi-agent framework (AnomAgent) with lightweight HITL verification. The pipeline yields structured anomaly annotations (Name, Phenomenon, Reasoning, Severity) and enables precise evaluation with SemAP and SemF1 metrics. A strong result is the AnomReasonor-7B model fine-tuned on this supervision, achieving state-of-the-art SemAP scores and competitive explainable deepfake detection performance, surpassing many open baselines and approaching proprietary systems on reasoning tasks. The framework enables applications in explainable deepfake detection and semantic reasonableness assessment, offering a scalable, interpretable pathway for auditing AIGC for plausibility and trustworthiness, with plans to extend to video and further improve annotation quality.

Abstract

The rapid advancement of AI-generated content (AIGC) has enabled the synthesis of visually convincing images; however, many such outputs exhibit subtle \textbf{semantic anomalies}, including unrealistic object configurations, violations of physical laws, or commonsense inconsistencies, which compromise the overall plausibility of the generated scenes. Detecting these semantic-level anomalies is essential for assessing the trustworthiness of AIGC media, especially in AIGC image analysis, explainable deepfake detection and semantic authenticity assessment. In this paper, we formalize \textbf{semantic anomaly detection and reasoning} for AIGC images and introduce \textbf{AnomReason}, a large-scale benchmark with structured annotations as quadruples \emph{(Name, Phenomenon, Reasoning, Severity)}. Annotations are produced by a modular multi-agent pipeline (\textbf{AnomAgent}) with lightweight human-in-the-loop verification, enabling scale while preserving quality. At construction time, AnomAgent processed approximately 4.17\,B GPT-4o tokens, providing scale evidence for the resulting structured annotations. We further show that models fine-tuned on AnomReason achieve consistent gains over strong vision-language baselines under our proposed semantic matching metric (\textit{SemAP} and \textit{SemF1}). Applications to {explainable deepfake detection} and {semantic reasonableness assessment of image generators} demonstrate practical utility. In summary, AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images. We will release code, metrics, data, and task-aligned models to support reproducible research on semantic authenticity and interpretable AIGC forensics.

Paper Structure

This paper contains 38 sections, 17 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Semantic anomaly detection in AIGC-generated images.(a) Illustration of high-level semantic anomalies that are context-dependent and subtle, such as inconsistent physics, anatomy, and reflections—challenges that go beyond surface-level visual artifacts. (b) Comparison of detection performance between general-purpose vision-language models (e.g., GPT-5, Qwen2.5vl-72B) and the proposed AnomAgent. While the former focus on surface-level cues such as lighting and textures, AnomAgent identifies fine-grained semantic inconsistencies and provides structured, explainable outputs with severity ratings.
  • Figure 2: Overview of the AnomAgent pipeline for semantic anomaly annotation. Stage 1 parses visual entities and yields an object list $\mathcal{O}$. Stage 2 performs multi-perspective anomaly mining, producing attribute candidates $\mathcal{C}_{\text{attr}}$ and relational candidates $\mathcal{C}_{\text{rel}}$, which are scored and pruned to $\mathcal{C}^{+}$. Stage 3 consolidates candidates (merging near-duplicates to $\hat{\mathcal{C}}$) and outputs structured anomalies $\mathcal{A}=\{(y,o,r,v)\}$ (Name, Phenomenon, Reasoning, Severity).
  • Figure 3: Example of structured anomalies. This figure illustrates a detected anomaly where two cylindrical pipes are unrealistically balanced on the individual’s shoulder. By structuring the anomaly as {Name, Observed Phenomenon, Reasoning, Severity Score}, the model not only provides a clear description of the anomaly but also offers an interpretable reasoning process, making it easier to understand why this arrangement is physically implausible. The severity score quantifies the degree of implausibility, enhancing the model's ability to observe and explain semantic-level anomalies. This structure allows for transparent and interpretable anomaly detection, improving the detection model's trustworthiness and explainability.
  • Figure 4: AnomReason statistics. (a) Total image count per category: Flux contains 6983 images, Sdv3.5 contains 4645, and Midjourney has the most with 9911 images. (b) Annotations before and after human evaluation: Flux has a reduction from 8.20 to 5.88 annotations per image, Sdv3.5 decreases from 8.11 to 5.86, and Midjourney shows a slight drop from 8.07 to 5.96 annotations per image. (c) Severity distribution before and after human evaluation: It showing a shift towards lower severity values after human evaluation, reflecting the refinement process in the annotation quality.
  • Figure 5: Structured visual anomalies in a tennis scene. AnomReasonor-7B identifies both surface-level inconsistencies (e.g., lighting and color mismatch) and deeper semantic-level anomalies, such as biomechanically implausible wrist articulation and unnatural hand–racket interaction. Each anomaly is described with a structured triplet: Name, Phenomenon, Reasoning.
  • ...and 2 more figures