Table of Contents
Fetching ...

Could AI Trace and Explain the Origins of AI-Generated Images and Text?

Hongchao Fang, Yixin Liu, Jiangshu Du, Can Qin, Ran Xu, Feng Liu, Lichao Sun, Dongwon Lee, Lifu Huang, Wenpeng Yin

TL;DR

The paper addresses the problem of tracing and explaining the origins of AI-generated content across text and image modalities, introducing the AI-Faker dataset to enable fine-grained comparisons among AI-generated images, text, and two malicious use cases. It benchmarks authorship attribution across four settings (diffusion-generated images, face-swapped images, AI-text-responding, AI-paper-reviewing) and analyzes GPT-4o’s explanations, uncovering that attribution success depends on output type and training objectives while explanations are more consistent across external models than for self-generated content. Key findings show diffusion-generated images are highly traceable, face-swapped content is hard to attribute, AI-paper-reviews are easy to attribute, and AI-text-responding is challenging; explanations reveal broader self-evaluation limits and cross-model similarities. The work also demonstrates that format and length influence detection performance and provides insights into how attribution signals may evolve as models become more human-aligned, offering practical implications for digital content integrity and the development of robust attribution frameworks.

Abstract

AI-generated content is becoming increasingly prevalent in the real world, leading to serious ethical and societal concerns. For instance, adversaries might exploit large multimodal models (LMMs) to create images that violate ethical or legal standards, while paper reviewers may misuse large language models (LLMs) to generate reviews without genuine intellectual effort. While prior work has explored detecting AI-generated images and texts, and occasionally tracing their source models, there is a lack of a systematic and fine-grained comparative study. Important dimensions--such as AI-generated images vs. text, fully vs. partially AI-generated images, and general vs. malicious use cases--remain underexplored. Furthermore, whether AI systems like GPT-4o can explain why certain forged content is attributed to specific generative models is still an open question, with no existing benchmark addressing this. To fill this gap, we introduce AI-FAKER, a comprehensive multimodal dataset with over 280,000 samples spanning multiple LLMs and LMMs, covering both general and malicious use cases for AI-generated images and texts. Our experiments reveal two key findings: (i) AI authorship detection depends not only on the generated output but also on the model's original training intent; and (ii) GPT-4o provides highly consistent but less specific explanations when analyzing content produced by OpenAI's own models, such as DALL-E and GPT-4o itself.

Could AI Trace and Explain the Origins of AI-Generated Images and Text?

TL;DR

The paper addresses the problem of tracing and explaining the origins of AI-generated content across text and image modalities, introducing the AI-Faker dataset to enable fine-grained comparisons among AI-generated images, text, and two malicious use cases. It benchmarks authorship attribution across four settings (diffusion-generated images, face-swapped images, AI-text-responding, AI-paper-reviewing) and analyzes GPT-4o’s explanations, uncovering that attribution success depends on output type and training objectives while explanations are more consistent across external models than for self-generated content. Key findings show diffusion-generated images are highly traceable, face-swapped content is hard to attribute, AI-paper-reviews are easy to attribute, and AI-text-responding is challenging; explanations reveal broader self-evaluation limits and cross-model similarities. The work also demonstrates that format and length influence detection performance and provides insights into how attribution signals may evolve as models become more human-aligned, offering practical implications for digital content integrity and the development of robust attribution frameworks.

Abstract

AI-generated content is becoming increasingly prevalent in the real world, leading to serious ethical and societal concerns. For instance, adversaries might exploit large multimodal models (LMMs) to create images that violate ethical or legal standards, while paper reviewers may misuse large language models (LLMs) to generate reviews without genuine intellectual effort. While prior work has explored detecting AI-generated images and texts, and occasionally tracing their source models, there is a lack of a systematic and fine-grained comparative study. Important dimensions--such as AI-generated images vs. text, fully vs. partially AI-generated images, and general vs. malicious use cases--remain underexplored. Furthermore, whether AI systems like GPT-4o can explain why certain forged content is attributed to specific generative models is still an open question, with no existing benchmark addressing this. To fill this gap, we introduce AI-FAKER, a comprehensive multimodal dataset with over 280,000 samples spanning multiple LLMs and LMMs, covering both general and malicious use cases for AI-generated images and texts. Our experiments reveal two key findings: (i) AI authorship detection depends not only on the generated output but also on the model's original training intent; and (ii) GPT-4o provides highly consistent but less specific explanations when analyzing content produced by OpenAI's own models, such as DALL-E and GPT-4o itself.

Paper Structure

This paper contains 22 sections, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Illustration about the four settings in AI-Faker.
  • Figure 2: Diffusion-generated images with the prompt: draw a picture with a man fishing.
  • Figure 3: Quality of GPT4o's explanations to the origins of Diffusion-generated images and AI-paper-reviewing.
  • Figure 4: Length effects on AI-paper-reviewing.
  • Figure 5: Confusion matrices for tracing Diffusion-generated images (left) and AI-paper-reviewing (right). Note diagonal values are set to 0 to highlight inter-model misclassification.
  • ...and 1 more figures