Table of Contents
Fetching ...

Detecting AI-Generated Images via CLIP

A. G. Moskowitz, T. Gaona, J. Peterson

TL;DR

The paper addresses detecting AI-generated images and identifying their generation method by fine-tuning a pretrained CLIP model on a diverse set of real and AI-generated images from multiple generators. By reframing detection as image-caption matching and training on a labeled caption set, the approach achieves strong accuracy, often surpassing specialized detectors while using fewer resources and no architectural changes. The results demonstrate the value of massive multimodal pretraining for robust, adaptable AIGI detection and suggest practical benefits for broad deployment and ongoing monitoring of AI-generated content.

Abstract

As AI-generated image (AIGI) methods become more powerful and accessible, it has become a critical task to determine if an image is real or AI-generated. Because AIGI lack the signatures of photographs and have their own unique patterns, new models are needed to determine if an image is AI-generated. In this paper, we investigate the ability of the Contrastive Language-Image Pre-training (CLIP) architecture, pre-trained on massive internet-scale data sets, to perform this differentiation. We fine-tune CLIP on real images and AIGI from several generative models, enabling CLIP to determine if an image is AI-generated and, if so, determine what generation method was used to create it. We show that the fine-tuned CLIP architecture is able to differentiate AIGI as well or better than models whose architecture is specifically designed to detect AIGI. Our method will significantly increase access to AIGI-detecting tools and reduce the negative effects of AIGI on society, as our CLIP fine-tuning procedures require no architecture changes from publicly available model repositories and consume significantly less GPU resources than other AIGI detection models.

Detecting AI-Generated Images via CLIP

TL;DR

The paper addresses detecting AI-generated images and identifying their generation method by fine-tuning a pretrained CLIP model on a diverse set of real and AI-generated images from multiple generators. By reframing detection as image-caption matching and training on a labeled caption set, the approach achieves strong accuracy, often surpassing specialized detectors while using fewer resources and no architectural changes. The results demonstrate the value of massive multimodal pretraining for robust, adaptable AIGI detection and suggest practical benefits for broad deployment and ongoing monitoring of AI-generated content.

Abstract

As AI-generated image (AIGI) methods become more powerful and accessible, it has become a critical task to determine if an image is real or AI-generated. Because AIGI lack the signatures of photographs and have their own unique patterns, new models are needed to determine if an image is AI-generated. In this paper, we investigate the ability of the Contrastive Language-Image Pre-training (CLIP) architecture, pre-trained on massive internet-scale data sets, to perform this differentiation. We fine-tune CLIP on real images and AIGI from several generative models, enabling CLIP to determine if an image is AI-generated and, if so, determine what generation method was used to create it. We show that the fine-tuned CLIP architecture is able to differentiate AIGI as well or better than models whose architecture is specifically designed to detect AIGI. Our method will significantly increase access to AIGI-detecting tools and reduce the negative effects of AIGI on society, as our CLIP fine-tuning procedures require no architecture changes from publicly available model repositories and consume significantly less GPU resources than other AIGI detection models.
Paper Structure (8 sections, 1 equation, 6 tables)

This paper contains 8 sections, 1 equation, 6 tables.