Table of Contents
Fetching ...

Mixture of Low-rank Experts for Transferable AI-Generated Image Detection

Zihan Liu, Hanyi Wang, Yaoyu Kang, Shilin Wang

TL;DR

The paper tackles the challenge of universal AI-generated image detection across unseen generators. It proposes a parameter-efficient approach that freezes the CLIP-ViT backbone and fine-tunes only the MLPs of the deepest blocks using a mixture of shared and separate low-rank adapters within a Mixture-of-Experts framework, guided by a routing mechanism. This yields state-of-the-art cross-generator generalization on UnivFD and GenImage, with the ViT-L/14 variant achieving substantial gains while training only a tiny fraction of parameters ($\approx$0.08%) and with minimal data ($\approx$0.28% of samples). The method also demonstrates robustness to post-processing and highlights the importance of pre-trained vision-language backbones and adapter design for transferable detection, suggesting a promising direction for scalable, interpretable AI-forensics.

Abstract

Generative models have shown a giant leap in synthesizing photo-realistic images with minimal expertise, sparking concerns about the authenticity of online information. This study aims to develop a universal AI-generated image detector capable of identifying images from diverse sources. Existing methods struggle to generalize across unseen generative models when provided with limited sample sources. Inspired by the zero-shot transferability of pre-trained vision-language models, we seek to harness the nontrivial visual-world knowledge and descriptive proficiency of CLIP-ViT to generalize over unknown domains. This paper presents a novel parameter-efficient fine-tuning approach, mixture of low-rank experts, to fully exploit CLIP-ViT's potential while preserving knowledge and expanding capacity for transferable detection. We adapt only the MLP layers of deeper ViT blocks via an integration of shared and separate LoRAs within an MoE-based structure. Extensive experiments on public benchmarks show that our method achieves superiority over state-of-the-art approaches in cross-generator generalization and robustness to perturbations. Remarkably, our best-performing ViT-L/14 variant requires training only 0.08% of its parameters to surpass the leading baseline by +3.64% mAP and +12.72% avg.Acc across unseen diffusion and autoregressive models. This even outperforms the baseline with just 0.28% of the training data. Our code and pre-trained models will be available at https://github.com/zhliuworks/CLIPMoLE.

Mixture of Low-rank Experts for Transferable AI-Generated Image Detection

TL;DR

The paper tackles the challenge of universal AI-generated image detection across unseen generators. It proposes a parameter-efficient approach that freezes the CLIP-ViT backbone and fine-tunes only the MLPs of the deepest blocks using a mixture of shared and separate low-rank adapters within a Mixture-of-Experts framework, guided by a routing mechanism. This yields state-of-the-art cross-generator generalization on UnivFD and GenImage, with the ViT-L/14 variant achieving substantial gains while training only a tiny fraction of parameters (0.08%) and with minimal data (0.28% of samples). The method also demonstrates robustness to post-processing and highlights the importance of pre-trained vision-language backbones and adapter design for transferable detection, suggesting a promising direction for scalable, interpretable AI-forensics.

Abstract

Generative models have shown a giant leap in synthesizing photo-realistic images with minimal expertise, sparking concerns about the authenticity of online information. This study aims to develop a universal AI-generated image detector capable of identifying images from diverse sources. Existing methods struggle to generalize across unseen generative models when provided with limited sample sources. Inspired by the zero-shot transferability of pre-trained vision-language models, we seek to harness the nontrivial visual-world knowledge and descriptive proficiency of CLIP-ViT to generalize over unknown domains. This paper presents a novel parameter-efficient fine-tuning approach, mixture of low-rank experts, to fully exploit CLIP-ViT's potential while preserving knowledge and expanding capacity for transferable detection. We adapt only the MLP layers of deeper ViT blocks via an integration of shared and separate LoRAs within an MoE-based structure. Extensive experiments on public benchmarks show that our method achieves superiority over state-of-the-art approaches in cross-generator generalization and robustness to perturbations. Remarkably, our best-performing ViT-L/14 variant requires training only 0.08% of its parameters to surpass the leading baseline by +3.64% mAP and +12.72% avg.Acc across unseen diffusion and autoregressive models. This even outperforms the baseline with just 0.28% of the training data. Our code and pre-trained models will be available at https://github.com/zhliuworks/CLIPMoLE.
Paper Structure (15 sections, 3 equations, 8 figures, 6 tables)

This paper contains 15 sections, 3 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Fine-tuning methods comparison. Compared to (b) full fine-tuning 20wang and (c) linear probing 23universal, our method allows for more effective and efficient adaption of CLIP-ViT for this task.
  • Figure 2: Average FFT spectra of the high-pass filtered images. The first one represents real images from ImageNet 09imagenet. The last 11 spectra correspond to distinct classes of fake images.
  • Figure 3: Overview of our proposed mixture of low-rank experts for AI-generated image detection. For the last three blocks of CLIP ViT-B/32, we introduce an integration of shared and separate LoRAs of different ranks as adaptable low-rank experts. The router is responsible for assigning each token to one separate LoRA expert in the MLP layer of each block. During fine-tuning, only the LoRA experts, the routers, and the MLP head are optimized.
  • Figure 4: Robustness to post-processing operations. We assess the resilience of the detectors against two test-time perturbations, i.e.(a) Gaussian blur and (b) JPEG compression, using the UnivFD dataset. We compare our method to CNNSpot and UnivFD, all trained under identical settings. We show the mAP results averaged across 19 generators and three types of generative models.
  • Figure 5: Effect of pre-trained backbone. The pre-trained ViT of CLIP substantially enhances generalization across nearly all generators when compared to that pre-trained on ImageNet classification. The red dotted line indicates chance performance.
  • ...and 3 more figures