Table of Contents
Fetching ...

Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts

Philipp Moeßner, Heike Adel

TL;DR

This work addresses the problem of distinguishing real from AI-generated images and how the level of prompt detail affects detectability. It introduces COCOXGEN, a benchmark combining real COCO photos with SDXL and Fooocus-generated images produced from short and long prompts, and conducts a large-scale user study (n=200) alongside a state-of-the-art machine detector (Grag2021) to directly compare human and machine performance. Key findings show that longer, more detailed prompts make generated images easier to detect for both humans and machines, with humans relying more on general image impressions and detectors on fine details; generation model differences also modulate detectability, with Fooocus images being harder to detect than SDXL. The study provides insights into detector limitations under downsampling, demonstrates partial alignment between human and machine attention, and highlights practical implications for real-world detection and prompt design, while outlining future work to extend analyses to additional models and prompt characteristics. The results are framed around hypotheses $H_1$ and $H_2$, formalizing the expected impact of prompt detail and generation model quality on detectability, and underscore the potential of combining human and machine cues to counter AI-generated disinformation.

Abstract

With the advent of publicly available AI-based text-to-image systems, the process of creating photorealistic but fully synthetic images has been largely democratized. This can pose a threat to the public through a simplified spread of disinformation. Machine detectors and human media expertise can help to differentiate between AI-generated (fake) and real images and counteract this danger. Although AI generation models are highly prompt-dependent, the impact of the prompt on the fake detection performance has rarely been investigated yet. This work therefore examines the influence of the prompt's level of detail on the detectability of fake images, both with an AI detector and in a user study. For this purpose, we create a novel dataset, COCOXGEN, which consists of real photos from the COCO dataset as well as images generated with SDXL and Fooocus using prompts of two standardized lengths. Our user study with 200 participants shows that images generated with longer, more detailed prompts are detected significantly more easily than those generated with short prompts. Similarly, an AI-based detection model achieves better performance on images generated with longer prompts. However, humans and AI models seem to pay attention to different details, as we show in a heat map analysis.

Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts

TL;DR

This work addresses the problem of distinguishing real from AI-generated images and how the level of prompt detail affects detectability. It introduces COCOXGEN, a benchmark combining real COCO photos with SDXL and Fooocus-generated images produced from short and long prompts, and conducts a large-scale user study (n=200) alongside a state-of-the-art machine detector (Grag2021) to directly compare human and machine performance. Key findings show that longer, more detailed prompts make generated images easier to detect for both humans and machines, with humans relying more on general image impressions and detectors on fine details; generation model differences also modulate detectability, with Fooocus images being harder to detect than SDXL. The study provides insights into detector limitations under downsampling, demonstrates partial alignment between human and machine attention, and highlights practical implications for real-world detection and prompt design, while outlining future work to extend analyses to additional models and prompt characteristics. The results are framed around hypotheses and , formalizing the expected impact of prompt detail and generation model quality on detectability, and underscore the potential of combining human and machine cues to counter AI-generated disinformation.

Abstract

With the advent of publicly available AI-based text-to-image systems, the process of creating photorealistic but fully synthetic images has been largely democratized. This can pose a threat to the public through a simplified spread of disinformation. Machine detectors and human media expertise can help to differentiate between AI-generated (fake) and real images and counteract this danger. Although AI generation models are highly prompt-dependent, the impact of the prompt on the fake detection performance has rarely been investigated yet. This work therefore examines the influence of the prompt's level of detail on the detectability of fake images, both with an AI detector and in a user study. For this purpose, we create a novel dataset, COCOXGEN, which consists of real photos from the COCO dataset as well as images generated with SDXL and Fooocus using prompts of two standardized lengths. Our user study with 200 participants shows that images generated with longer, more detailed prompts are detected significantly more easily than those generated with short prompts. Similarly, an AI-based detection model achieves better performance on images generated with longer prompts. However, humans and AI models seem to pay attention to different details, as we show in a heat map analysis.

Paper Structure

This paper contains 34 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: We conduct a study with humans and an AI model detecting real and fake images generated from prompts with a different level of detail (short prompt (SP), long prompt (LP)) and visualize the image areas which led to their decisions.
  • Figure 2: Our contributions are (1) the creation of COCOXGEN, a novel benchmark dataset with images created by two different generation models based on prompts of two different levels of detail, (2) the conduction of a large-scale user study on fake image detection, and (3) a direct comparison of human and machine detection performance and decision rationales.
  • Figure 3: F1 scores per prompt type (SP: short prompt with less details, LP: long prompt with more details).
  • Figure 4: The image area selection results show that participants pay particular attention to clear objects.
  • Figure 5: F1 scores per image generator (left and middle; the lower, the more realistic are the generated images) and for real COCO images (right; the higher the better)
  • ...and 7 more figures