Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts

Philipp Moeßner; Heike Adel

Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts

Philipp Moeßner, Heike Adel

TL;DR

This work addresses the problem of distinguishing real from AI-generated images and how the level of prompt detail affects detectability. It introduces COCOXGEN, a benchmark combining real COCO photos with SDXL and Fooocus-generated images produced from short and long prompts, and conducts a large-scale user study (n=200) alongside a state-of-the-art machine detector (Grag2021) to directly compare human and machine performance. Key findings show that longer, more detailed prompts make generated images easier to detect for both humans and machines, with humans relying more on general image impressions and detectors on fine details; generation model differences also modulate detectability, with Fooocus images being harder to detect than SDXL. The study provides insights into detector limitations under downsampling, demonstrates partial alignment between human and machine attention, and highlights practical implications for real-world detection and prompt design, while outlining future work to extend analyses to additional models and prompt characteristics. The results are framed around hypotheses $H_1$ and $H_2$, formalizing the expected impact of prompt detail and generation model quality on detectability, and underscore the potential of combining human and machine cues to counter AI-generated disinformation.

Abstract

With the advent of publicly available AI-based text-to-image systems, the process of creating photorealistic but fully synthetic images has been largely democratized. This can pose a threat to the public through a simplified spread of disinformation. Machine detectors and human media expertise can help to differentiate between AI-generated (fake) and real images and counteract this danger. Although AI generation models are highly prompt-dependent, the impact of the prompt on the fake detection performance has rarely been investigated yet. This work therefore examines the influence of the prompt's level of detail on the detectability of fake images, both with an AI detector and in a user study. For this purpose, we create a novel dataset, COCOXGEN, which consists of real photos from the COCO dataset as well as images generated with SDXL and Fooocus using prompts of two standardized lengths. Our user study with 200 participants shows that images generated with longer, more detailed prompts are detected significantly more easily than those generated with short prompts. Similarly, an AI-based detection model achieves better performance on images generated with longer prompts. However, humans and AI models seem to pay attention to different details, as we show in a heat map analysis.

Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts

TL;DR

Abstract

Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)