Table of Contents
Fetching ...

The Visual Counter Turing Test (VCT2): A Benchmark for Evaluating AI-Generated Image Detection and the Visual AI Index (VAI)

Nasrin Imanpour, Abhilekh Borah, Shashwat Bajpai, Subhankar Ghosh, Sainath Reddy Sankepally, Hasnat Md Abdullah, Nishoak Kosaraju, Shreyas Dixit, Ashhar Aziz, Shwetangshu Biswas, Vinija Jain, Aman Chadha, Song Wang, Amit Sheth, Amitava Das

TL;DR

The paper addresses misinformation risks from AI-generated imagery by revealing weak generalization of existing AGID methods to unseen models. It introduces VCT2, a large zero-shot benchmark with ~166k images from six T2I models across COCOAI and TwitterAI prompts, and the Visual AI Index (VAI), a 12-feature, logistic-regression realism score. Evaluating 17 detectors reveals average detection around 58%, with performance degrading on newer/proprietary models, underscoring distribution shifts. VAI correlates inversely with detectability, suggesting more realistic images are harder to flag, though statistical significance is limited by the number of models tested. The work provides public data and code, offering a robust testbed and a practical realism metric to guide development of generalized AGID methods and perceptual realism assessments.

Abstract

The rapid progress and widespread availability of text-to-image (T2I) generative models have heightened concerns about the misuse of AI-generated visuals, particularly in the context of misinformation campaigns. Existing AI-generated image detection (AGID) methods often overfit to known generators and falter on outputs from newer or unseen models. We introduce the Visual Counter Turing Test (VCT2), a comprehensive benchmark of 166,000 images, comprising both real and synthetic prompt-image pairs produced by six state-of-the-art T2I systems: Stable Diffusion 2.1, SDXL, SD3 Medium, SD3.5 Large, DALL.E 3, and Midjourney 6. We curate two distinct subsets: COCOAI, featuring structured captions from MS COCO, and TwitterAI, containing narrative-style tweets from The New York Times. Under a unified zero-shot evaluation, we benchmark 17 leading AGID models and observe alarmingly low detection accuracy, 58% on COCOAI and 58.34% on TwitterAI. To transcend binary classification, we propose the Visual AI Index (VAI), an interpretable, prompt-agnostic realism metric based on twelve low-level visual features, enabling us to quantify and rank the perceptual quality of generated outputs with greater nuance. Correlation analysis reveals a moderate inverse relationship between VAI and detection accuracy: Pearson of -0.532 on COCOAI and -0.503 on TwitterAI, suggesting that more visually realistic images tend to be harder to detect, a trend observed consistently across generators. We release COCOAI, TwitterAI, and all codes to catalyze future advances in generalized AGID and perceptual realism assessment.

The Visual Counter Turing Test (VCT2): A Benchmark for Evaluating AI-Generated Image Detection and the Visual AI Index (VAI)

TL;DR

The paper addresses misinformation risks from AI-generated imagery by revealing weak generalization of existing AGID methods to unseen models. It introduces VCT2, a large zero-shot benchmark with ~166k images from six T2I models across COCOAI and TwitterAI prompts, and the Visual AI Index (VAI), a 12-feature, logistic-regression realism score. Evaluating 17 detectors reveals average detection around 58%, with performance degrading on newer/proprietary models, underscoring distribution shifts. VAI correlates inversely with detectability, suggesting more realistic images are harder to flag, though statistical significance is limited by the number of models tested. The work provides public data and code, offering a robust testbed and a practical realism metric to guide development of generalized AGID methods and perceptual realism assessments.

Abstract

The rapid progress and widespread availability of text-to-image (T2I) generative models have heightened concerns about the misuse of AI-generated visuals, particularly in the context of misinformation campaigns. Existing AI-generated image detection (AGID) methods often overfit to known generators and falter on outputs from newer or unseen models. We introduce the Visual Counter Turing Test (VCT2), a comprehensive benchmark of 166,000 images, comprising both real and synthetic prompt-image pairs produced by six state-of-the-art T2I systems: Stable Diffusion 2.1, SDXL, SD3 Medium, SD3.5 Large, DALL.E 3, and Midjourney 6. We curate two distinct subsets: COCOAI, featuring structured captions from MS COCO, and TwitterAI, containing narrative-style tweets from The New York Times. Under a unified zero-shot evaluation, we benchmark 17 leading AGID models and observe alarmingly low detection accuracy, 58% on COCOAI and 58.34% on TwitterAI. To transcend binary classification, we propose the Visual AI Index (VAI), an interpretable, prompt-agnostic realism metric based on twelve low-level visual features, enabling us to quantify and rank the perceptual quality of generated outputs with greater nuance. Correlation analysis reveals a moderate inverse relationship between VAI and detection accuracy: Pearson of -0.532 on COCOAI and -0.503 on TwitterAI, suggesting that more visually realistic images tend to be harder to detect, a trend observed consistently across generators. We release COCOAI, TwitterAI, and all codes to catalyze future advances in generalized AGID and perceptual realism assessment.

Paper Structure

This paper contains 23 sections, 1 equation, 15 figures, 6 tables.

Figures (15)

  • Figure 1: An AI-generated image of Pope Francis wearing a gigantic white puffer jacket went viral on social media platforms like Reddit and Twitter (X) in March 2023. This image sparked widespread media discussions on the potential misuse of generative AI technologies, becoming an iconic example of AI-generated misinformation. For more details, see the https://www.forbes.com/sites/mattnovak/2023/03/26/that-viral-image-of-pope-francis-wearing-a-white-puffer-coat-is-totally-fake/.
  • Figure 2: The taxonomy of AI-generated image detection techniques, categorized into three main groups: Generation Artifact-Based Detection, Feature Representation-Based Detection, and Hybrid Techniques.
  • Figure 3: Average detection accuracy across the COCOAI and TwitterAI subsets for each generator.
  • Figure 4: Right: $V_{AI}$ scores of COCOAI dataset. Left: Accuracy heat maps showing the average accuracy of each AGID method.
  • Figure 5: Right: $V_{AI}$ scores of TwitterAI dataset. Left: Accuracy heat maps showing the average accuracy of each AGID method.
  • ...and 10 more figures