Table of Contents
Fetching ...

INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts

Anshul Bagaria

TL;DR

INSIGHT presents a unified multimodal forensic pipeline designed to robustly detect AI-generated images under extreme degradation while providing interpretable, semantically grounded explanations. It combines hierarchical super-resolution (DRCT), superpixel-based attention localization, GradCAM-guided artifact spotting, CLIP-based semantic alignment, and a ReAct + Chain-of-Thought reasoning framework, all validated by a multimodal LLM judge and G-Eval rubric-based scoring. Across diverse domains and challenging datasets, INSIGHT achieves competitive detection performance, improved explanation quality, and reliable verification, demonstrating the value of integrating perceptual cues with structured reasoning and evidence-grounded narratives. The work also includes thorough ablations, adversarial robustness tests, and audience-tailored reporting, underscoring the importance of fidelity, interpretability, and reliability for trustworthy multimodal forensics in real-world settings.

Abstract

The growing realism of AI-generated images produced by recent GAN and diffusion models has intensified concerns over the reliability of visual media. Yet, despite notable progress in deepfake detection, current forensic systems degrade sharply under real-world conditions such as severe downsampling, compression, and cross-domain distribution shifts. Moreover, most detectors operate as opaque classifiers, offering little insight into why an image is flagged as synthetic, undermining trust and hindering adoption in high-stakes settings. We introduce INSIGHT (Interpretable Neural Semantic and Image-based Generative-forensic Hallucination Tracing), a unified multimodal framework for robust detection and transparent explanation of AI-generated images, even at extremely low resolutions (16x16 - 64x64). INSIGHT combines hierarchical super-resolution for amplifying subtle forensic cues without inducing misleading artifacts, Grad-CAM driven multi-scale localization to reveal spatial regions indicative of generative patterns, and CLIP-guided semantic alignment to map visual anomalies to human-interpretable descriptors. A vision-language model is then prompted using a structured ReAct + Chain-of-Thought protocol to produce consistent, fine-grained explanations, verified through a dual-stage G-Eval + LLM-as-a-judge pipeline to minimize hallucinations and ensure factuality. Across diverse domains, including animals, vehicles, and abstract synthetic scenes, INSIGHT substantially improves both detection robustness and explanation quality under extreme degradation, outperforming prior detectors and black-box VLM baselines. Our results highlight a practical path toward transparent, reliable AI-generated image forensics and establish INSIGHT as a step forward in trustworthy multimodal content verification.

INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts

TL;DR

INSIGHT presents a unified multimodal forensic pipeline designed to robustly detect AI-generated images under extreme degradation while providing interpretable, semantically grounded explanations. It combines hierarchical super-resolution (DRCT), superpixel-based attention localization, GradCAM-guided artifact spotting, CLIP-based semantic alignment, and a ReAct + Chain-of-Thought reasoning framework, all validated by a multimodal LLM judge and G-Eval rubric-based scoring. Across diverse domains and challenging datasets, INSIGHT achieves competitive detection performance, improved explanation quality, and reliable verification, demonstrating the value of integrating perceptual cues with structured reasoning and evidence-grounded narratives. The work also includes thorough ablations, adversarial robustness tests, and audience-tailored reporting, underscoring the importance of fidelity, interpretability, and reliability for trustworthy multimodal forensics in real-world settings.

Abstract

The growing realism of AI-generated images produced by recent GAN and diffusion models has intensified concerns over the reliability of visual media. Yet, despite notable progress in deepfake detection, current forensic systems degrade sharply under real-world conditions such as severe downsampling, compression, and cross-domain distribution shifts. Moreover, most detectors operate as opaque classifiers, offering little insight into why an image is flagged as synthetic, undermining trust and hindering adoption in high-stakes settings. We introduce INSIGHT (Interpretable Neural Semantic and Image-based Generative-forensic Hallucination Tracing), a unified multimodal framework for robust detection and transparent explanation of AI-generated images, even at extremely low resolutions (16x16 - 64x64). INSIGHT combines hierarchical super-resolution for amplifying subtle forensic cues without inducing misleading artifacts, Grad-CAM driven multi-scale localization to reveal spatial regions indicative of generative patterns, and CLIP-guided semantic alignment to map visual anomalies to human-interpretable descriptors. A vision-language model is then prompted using a structured ReAct + Chain-of-Thought protocol to produce consistent, fine-grained explanations, verified through a dual-stage G-Eval + LLM-as-a-judge pipeline to minimize hallucinations and ensure factuality. Across diverse domains, including animals, vehicles, and abstract synthetic scenes, INSIGHT substantially improves both detection robustness and explanation quality under extreme degradation, outperforming prior detectors and black-box VLM baselines. Our results highlight a practical path toward transparent, reliable AI-generated image forensics and establish INSIGHT as a step forward in trustworthy multimodal content verification.

Paper Structure

This paper contains 66 sections, 33 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: The Architecture of the INSIGHT Binary Classifier Backbone The design features a hybrid CNN–ResNet backbone operating on frequency-domain transformed inputs, integrating contrastive pre-training and a final stage for adversarial robustness. The resulting logits and activation maps serve as trustworthy inputs for subsequent downstream tasks.
  • Figure 2: Stage 1: Workflow for Low-Resolution Structure Recovery and Attention-Guided Artifact Localization The pipeline begins with Hierarchical Forensic Super-Resolution via DRCT, transforming a low-resolution input into a higher-resolution representation. Simultaneously, GradCAM is employed for Attention-Guided Artifact Localization, generating a saliency map that highlights critical regions. This attention information, combined with Superpixel-Aware Region Proposals, guides the subsequent Patch Extraction and Attention-Weighted Superpixel Grouping module to robustly recover and emphasize important structural details from the upscaled image.
  • Figure 3: Attention-Weighted Superpixel-to-Patch Decomposition for Hierarchical Forensic Reasoning The module addresses the limitations of standard patching by adopting a geometrically consistent and salience-aware decomposition. Superpixel regions are subdivided into fixed-size patches $P_{k,i}$. Forensic salience is integrated using GradCAM activations $A(x, y)$ to derive a raw patch weight $\alpha_{k,i}$. A contrast-sharpening function generates the attention-derived weight $w_{k,i}$, focusing the analysis on high-salience regions. Crucially, hierarchical consistency is enforced by modulating $w_{k,i}$ with the parent superpixel activation $\sigma(A(S_k))$ to obtain the final weight $\tilde{w}_{k,i}$. The resulting weighted patches are then mapped into a semantic feature space $\mathbf{v}_{k,i} = f_{\text{enc}}(P_{k,i})$ for downstream multimodal reasoning and artifact-level explanation.
  • Figure 4: Semantic Scoring and Multimodal Feature Alignment using Dual-Granularity CLIP Embeddings This module quantifies the forensic suspicion associated with different manipulation categories by leveraging CLIP's zero-shot capability. For each superpixel patch $p_i$, a visual embedding $\mathbf{z}_i = f_{\text{img}}(p_i)$ is computed and compared via cosine similarity $\text{Sim}(\cdot, \cdot)$ against textual embeddings $\mathbf{u}_c = f_{\text{text}}(t_c)$ of curated artifact descriptor prompts. The final unified semantic score $S_c$ for each category $c$ is calculated by aggregating similarities across both coarse and fine superpixel partitions using a weighted average, ensuring both structural stability and fine-grained sensitivity.
  • Figure 5: Grounded Forensic Explanation via the ReAct and Chain-of-Thought Frameworks The final reasoning stage utilizes the ReAct (Reason + Act) policy $\Phi$ (\ref{['eq:19']}) to ensure explanations are tightly coupled to visual evidence. This step-by-step process mitigates hallucination. Upon completion, the Chain-of-Thought (CoT) decoder $\Psi$ (\ref{['eq:20']}) synthesizes the accumulated deductions into a structured, verifiable forensic explanation ($\mathbf{E}_{\text{CoT}}$), providing a complete trace of the decision logic for transparency.
  • ...and 12 more figures