Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images
Chuangchuang Tan, Xiang Ming, Jinglu Wang, Renshuai Tao, Bin Li, Yunchao Wei, Yao Zhao, Yan Lu
TL;DR
This work tackles semantic-level inconsistencies in AI-generated imagery by formalizing semantic anomaly detection and reasoning and introducing the AnomReason benchmark, built via a modular multi-agent framework (AnomAgent) with lightweight HITL verification. The pipeline yields structured anomaly annotations (Name, Phenomenon, Reasoning, Severity) and enables precise evaluation with SemAP and SemF1 metrics. A strong result is the AnomReasonor-7B model fine-tuned on this supervision, achieving state-of-the-art SemAP scores and competitive explainable deepfake detection performance, surpassing many open baselines and approaching proprietary systems on reasoning tasks. The framework enables applications in explainable deepfake detection and semantic reasonableness assessment, offering a scalable, interpretable pathway for auditing AIGC for plausibility and trustworthiness, with plans to extend to video and further improve annotation quality.
Abstract
The rapid advancement of AI-generated content (AIGC) has enabled the synthesis of visually convincing images; however, many such outputs exhibit subtle \textbf{semantic anomalies}, including unrealistic object configurations, violations of physical laws, or commonsense inconsistencies, which compromise the overall plausibility of the generated scenes. Detecting these semantic-level anomalies is essential for assessing the trustworthiness of AIGC media, especially in AIGC image analysis, explainable deepfake detection and semantic authenticity assessment. In this paper, we formalize \textbf{semantic anomaly detection and reasoning} for AIGC images and introduce \textbf{AnomReason}, a large-scale benchmark with structured annotations as quadruples \emph{(Name, Phenomenon, Reasoning, Severity)}. Annotations are produced by a modular multi-agent pipeline (\textbf{AnomAgent}) with lightweight human-in-the-loop verification, enabling scale while preserving quality. At construction time, AnomAgent processed approximately 4.17\,B GPT-4o tokens, providing scale evidence for the resulting structured annotations. We further show that models fine-tuned on AnomReason achieve consistent gains over strong vision-language baselines under our proposed semantic matching metric (\textit{SemAP} and \textit{SemF1}). Applications to {explainable deepfake detection} and {semantic reasonableness assessment of image generators} demonstrate practical utility. In summary, AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images. We will release code, metrics, data, and task-aligned models to support reproducible research on semantic authenticity and interpretable AIGC forensics.
