Table of Contents
Fetching ...

Rethinking the Use of Vision Transformers for AI-Generated Image Detection

NaHyeon Park, Kunhee Kim, Junsuk Choe, Hyunjung Shim

TL;DR

The paper interrogates the assumption that last-layer CLIP-ViT features are optimal for AI-generated image detection. By conducting a detailed layer-wise analysis, it shows that mid-layer features carry strong discriminative power and that different layers encode complementary information. It then introduces MoLD, a gating-based fusion of CLS-token representations across all ViT layers, to adaptively weight and combine multi-layer features. Across GAN- and diffusion-based generators, MoLD achieves superior detection performance and generalizes well to other pre-trained ViTs, demonstrating the value of fully leveraging multi-layer ViT representations for robust forgery detection.

Abstract

Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.

Rethinking the Use of Vision Transformers for AI-Generated Image Detection

TL;DR

The paper interrogates the assumption that last-layer CLIP-ViT features are optimal for AI-generated image detection. By conducting a detailed layer-wise analysis, it shows that mid-layer features carry strong discriminative power and that different layers encode complementary information. It then introduces MoLD, a gating-based fusion of CLS-token representations across all ViT layers, to adaptively weight and combine multi-layer features. Across GAN- and diffusion-based generators, MoLD achieves superior detection performance and generalizes well to other pre-trained ViTs, demonstrating the value of fully leveraging multi-layer ViT representations for robust forgery detection.

Abstract

Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.

Paper Structure

This paper contains 26 sections, 7 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Layer-wise performance in AI-generated image detection. We train individual classifiers using pre-trained CLIP:ViT-L/14 features from different layers of a model on the GenImage-ADM training set. The performance of each layer-specific classifier is then evaluated on various test subsets, and the average precision is reported. The results indicate that the optimal performance (marked as star for each test subset) varies across datasets, with mid-layer features generally outperforming those from earlier or later layers.
  • Figure 2: Overlap in misclassified samples across layers. To analyze the decision-making process of classifiers trained on features from different layers, we measure the proportion of commonly misclassified samples. The results indicate that the overlap in misclassified samples across layers is relatively low, suggesting that each layer captures distinct aspects of the input data. This highlights the complementary nature of multi-layer representations and underscores the importance of integrating features from all layers for robust fake image detection.
  • Figure 3: Overview of our MoLD. Our approach fully leverages the features of a pre-trained Vision Transformer by aggregating the [CLS] token embeddings from all transformer layers. A dedicated lightweight network is applied at each layer to generate layer-wise predictions. These predictions are then processed by a learnable classification head to produce the final prediction. The layer-wise networks $g_i$ and the classification head are jointly trained using binary cross-entropy loss. Note that the ViT backbone remains frozen throughout the entire training process.
  • Figure 4: Perturbations
  • Figure 5: Train-data size
  • ...and 10 more figures