Table of Contents
Fetching ...

RIGID: A Training-free and Model-Agnostic Framework for Robust AI-Generated Image Detection

Zhiyuan He, Pin-Yu Chen, Tsung-Yi Ho

TL;DR

This work tackles AI-generated image detection without training detectors. It introduces RIGID, a training-free and model-agnostic framework that probes the sensitivity of real versus generated images by perturbing inputs and measuring cosine similarity in a pretrained feature space, specifically $sim = \cos(f(x), f(x+\lambda\delta))$. The authors provide a theoretical justification via Stein's lemma, showing generated images exhibit larger gradient norms in the smoothed similarity metric. Extensive experiments across ImageNet, LSUN-Bedroom, and GenImage demonstrate that RIGID outperforms both training-based and training-free baselines, generalizes across generation methods, and remains robust to common image corruptions, making it a cost-efficient and practical solution for robust AI-generated image detection.

Abstract

The rapid advances in generative AI models have empowered the creation of highly realistic images with arbitrary content, raising concerns about potential misuse and harm, such as Deepfakes. Current research focuses on training detectors using large datasets of generated images. However, these training-based solutions are often computationally expensive and show limited generalization to unseen generated images. In this paper, we propose a training-free method to distinguish between real and AI-generated images. We first observe that real images are more robust to tiny noise perturbations than AI-generated images in the representation space of vision foundation models. Based on this observation, we propose RIGID, a training-free and model-agnostic method for robust AI-generated image detection. RIGID is a simple yet effective approach that identifies whether an image is AI-generated by comparing the representation similarity between the original and the noise-perturbed counterpart. Our evaluation on a diverse set of AI-generated images and benchmarks shows that RIGID significantly outperforms existing trainingbased and training-free detectors. In particular, the average performance of RIGID exceeds the current best training-free method by more than 25%. Importantly, RIGID exhibits strong generalization across different image generation methods and robustness to image corruptions.

RIGID: A Training-free and Model-Agnostic Framework for Robust AI-Generated Image Detection

TL;DR

This work tackles AI-generated image detection without training detectors. It introduces RIGID, a training-free and model-agnostic framework that probes the sensitivity of real versus generated images by perturbing inputs and measuring cosine similarity in a pretrained feature space, specifically . The authors provide a theoretical justification via Stein's lemma, showing generated images exhibit larger gradient norms in the smoothed similarity metric. Extensive experiments across ImageNet, LSUN-Bedroom, and GenImage demonstrate that RIGID outperforms both training-based and training-free baselines, generalizes across generation methods, and remains robust to common image corruptions, making it a cost-efficient and practical solution for robust AI-generated image detection.

Abstract

The rapid advances in generative AI models have empowered the creation of highly realistic images with arbitrary content, raising concerns about potential misuse and harm, such as Deepfakes. Current research focuses on training detectors using large datasets of generated images. However, these training-based solutions are often computationally expensive and show limited generalization to unseen generated images. In this paper, we propose a training-free method to distinguish between real and AI-generated images. We first observe that real images are more robust to tiny noise perturbations than AI-generated images in the representation space of vision foundation models. Based on this observation, we propose RIGID, a training-free and model-agnostic method for robust AI-generated image detection. RIGID is a simple yet effective approach that identifies whether an image is AI-generated by comparing the representation similarity between the original and the noise-perturbed counterpart. Our evaluation on a diverse set of AI-generated images and benchmarks shows that RIGID significantly outperforms existing trainingbased and training-free detectors. In particular, the average performance of RIGID exceeds the current best training-free method by more than 25%. Importantly, RIGID exhibits strong generalization across different image generation methods and robustness to image corruptions.
Paper Structure (25 sections, 4 equations, 10 figures, 3 tables)

This paper contains 25 sections, 4 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Overview of RIGID.Upper left: visualization of the attention range of different models for real images and AI-generated (fake) images by GradCAM gradcam. CLIP and DINOV2 attend better to global context than ResNet 50. Upper right: visualization of the cosine similarity landscape for real and AI-generated images by plotting the interpolation of two random directions in the image pixel space with coefficients $\alpha$ and $\beta$, following landscape. We find that on DINOv2, real and AI-generated images exhibit distinct sensitivity results. See details of how to plot the landscape in Appendix \ref{['ap:landscape']}. Bottom: the framework of RIGID. RIGID uses a pretrained feature extractor to compute the pairwise cosine similarity on the original and noise-perturbed images for AI-generated image detection. The entire detection process is training-free, model-agnostic, and efficient. See Sec. \ref{['sec:rigid']} for details.
  • Figure 2: The average precision of various AI-generated image detectors on images generated by popular platforms (Wukong, SD1.4, SD1.5, and Midjourney).
  • Figure 3: Cross-dataset Evaluation on ImageNet and LSUN-Bedroom. The violin graph shows AP distribution, where the black bar in the center indicates the interquartile range and the white dot is the median.
  • Figure 4: Robustness to Image Corruptions. The top row shows the robustness to Gaussian noise ($\lambda$ represents the noise intensity). The second row shows the robustness to JPEG compression, and the bottom row shows the robustness to Gaussian blur.
  • Figure 5: Detection performance for different noise intensities (the value $\lambda$ in eq. \ref{['eq:rigid']}). The left/right y-axis is AP/Cosine-Similarity.
  • ...and 5 more figures