Table of Contents
Fetching ...

Diversity Matters: Dataset Diversification and Dual-Branch Network for Generalized AI-Generated Image Detection

Nusrat Tasnim, Kutub Uddin, Khalid Malik

Abstract

The rapid proliferation of AI-generated images, powered by generative adversarial networks (GANs), diffusion models, and other synthesis techniques, has raised serious concerns about misinformation, copyright violations, and digital security. However, detecting such images in a generalized and robust manner remains a major challenge due to the vast diversity of generative models and data distributions. In this work, we present \textbf{Diversity Matters}, a novel framework that emphasizes data diversity and feature domain complementarity for AI-generated image detection. The proposed method introduces a feature-domain similarity filtering mechanism that discards redundant or highly similar samples across both inter-class and intra-class distributions, ensuring a more diverse and representative training set. Furthermore, we propose a dual-branch network that combines CLIP features from the pixel domain and the frequency domain to jointly capture semantic and structural cues, leading to improved generalization against unseen generative models and adversarial conditions. Extensive experiments on benchmark datasets demonstrate that the proposed approach significantly improves cross-model and cross-dataset performance compared to existing methods. \textbf{Diversity Matters} highlights the critical role of data and feature diversity in building reliable and robust detectors against the rapidly evolving landscape of synthetic content.

Diversity Matters: Dataset Diversification and Dual-Branch Network for Generalized AI-Generated Image Detection

Abstract

The rapid proliferation of AI-generated images, powered by generative adversarial networks (GANs), diffusion models, and other synthesis techniques, has raised serious concerns about misinformation, copyright violations, and digital security. However, detecting such images in a generalized and robust manner remains a major challenge due to the vast diversity of generative models and data distributions. In this work, we present \textbf{Diversity Matters}, a novel framework that emphasizes data diversity and feature domain complementarity for AI-generated image detection. The proposed method introduces a feature-domain similarity filtering mechanism that discards redundant or highly similar samples across both inter-class and intra-class distributions, ensuring a more diverse and representative training set. Furthermore, we propose a dual-branch network that combines CLIP features from the pixel domain and the frequency domain to jointly capture semantic and structural cues, leading to improved generalization against unseen generative models and adversarial conditions. Extensive experiments on benchmark datasets demonstrate that the proposed approach significantly improves cross-model and cross-dataset performance compared to existing methods. \textbf{Diversity Matters} highlights the critical role of data and feature diversity in building reliable and robust detectors against the rapidly evolving landscape of synthetic content.

Paper Structure

This paper contains 25 sections, 12 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: SoTA vs diversity-aware dataset selection. (Left) Conventional approaches train detectors on all samples with high redundancy, resulting in dense clusters and overfitting. (Right) The proposed CLIP-based diversity-aware approach filters samples across both inter-class and intra-class to retain diverse samples in sparse clusters. Faded samples indicate removed similar pairs. This approach improves the model generalization and robustness while reducing the dataset size significantly for training the detectors.
  • Figure 2: Flow diagram of the proposed data diversification to improve generalizability across diverse generative models.
  • Figure 3: Architecture of the proposed diversity-aware and dual-branch AIGI detection framework. A diversity-aware selection strategy curates a representative training set. In the first branch (top), a frozen CLIP image encoder extracts embeddings from each patch of the real pixel-level input. In the second branch, the CLIP image encoder is fine-tuned using the corresponding frequency-spectrum patches. Class tokens from all patches across both branches are then concatenated and passed through a trainable detection head. The detection head predicts whether the input sample is real or fake.
  • Figure 4: Comparison of ROC curves for SoTA and the proposed AIGI detection models, with AUC and EER reported. Panels correspond to: (a) CNNDF/StyleGAN wang2020cnn, (b) GENI/ADM zhu2023genimage, and (c) GANDF/BEGAN tan2024rethinking. AUC indicates overall discrimination, whereas EER measures balanced false acceptance and false rejection rates.