Table of Contents
Fetching ...

Vulnerabilities in AI-generated Image Detection: The Challenge of Adversarial Attacks

Yunfeng Diao, Naixin Zhai, Changtao Miao, Zitong Yu, Xingxing Wei, Xun Yang, Meng Wang

TL;DR

The paper investigates the adversarial robustness of AI-generated image detectors and introduces FPBA, a Frequency-based Post-train Bayesian Attack, to break diverse detectors. FPBA blends frequency-domain perturbations with a Bayesian-augmented surrogate to improve cross-architecture transferability, supplemented by a hybrid gradient combining spatial and frequency cues. Through extensive experiments on 17 detectors across multiple generators and defense scenarios, FPBA achieves near-perfect white-box success and strong black-box transferability, including cross-generator and compressed-image settings, while maintaining high visual quality. The findings reveal persistent vulnerabilities in AIGI detectors, challenge the effectiveness of current defenses, and highlight gradient-masking as an insufficient defense strategy, underscoring the need for robust, generalizable detection approaches.

Abstract

Recent advancements in image synthesis, particularly with the advent of GAN and Diffusion models, have amplified public concerns regarding the dissemination of disinformation. To address such concerns, numerous AI-generated Image (AIGI) Detectors have been proposed and achieved promising performance in identifying fake images. However, there still lacks a systematic understanding of the adversarial robustness of AIGI detectors. In this paper, we examine the vulnerability of state-of-the-art AIGI detectors against adversarial attack under white-box and black-box settings, which has been rarely investigated so far. To this end, we propose a new method to attack AIGI detectors. First, inspired by the obvious difference between real images and fake images in the frequency domain, we add perturbations under the frequency domain to push the image away from its original frequency distribution. Second, we explore the full posterior distribution of the surrogate model to further narrow this gap between heterogeneous AIGI detectors, e.g., transferring adversarial examples across CNNs and ViTs. This is achieved by introducing a novel post-train Bayesian strategy that turns a single surrogate into a Bayesian one, capable of simulating diverse victim models using one pre-trained surrogate, without the need for re-training. We name our method as Frequency-based Post-train Bayesian Attack, or FPBA. Through FPBA, we demonstrate that adversarial attacks pose a real threat to AIGI detectors. FPBA can deliver successful black-box attacks across various detectors, generators, defense methods, and even evade cross-generator and compressed image detection, which are crucial real-world detection scenarios. Our code is available at https://github.com/onotoa/fpba.

Vulnerabilities in AI-generated Image Detection: The Challenge of Adversarial Attacks

TL;DR

The paper investigates the adversarial robustness of AI-generated image detectors and introduces FPBA, a Frequency-based Post-train Bayesian Attack, to break diverse detectors. FPBA blends frequency-domain perturbations with a Bayesian-augmented surrogate to improve cross-architecture transferability, supplemented by a hybrid gradient combining spatial and frequency cues. Through extensive experiments on 17 detectors across multiple generators and defense scenarios, FPBA achieves near-perfect white-box success and strong black-box transferability, including cross-generator and compressed-image settings, while maintaining high visual quality. The findings reveal persistent vulnerabilities in AIGI detectors, challenge the effectiveness of current defenses, and highlight gradient-masking as an insufficient defense strategy, underscoring the need for robust, generalizable detection approaches.

Abstract

Recent advancements in image synthesis, particularly with the advent of GAN and Diffusion models, have amplified public concerns regarding the dissemination of disinformation. To address such concerns, numerous AI-generated Image (AIGI) Detectors have been proposed and achieved promising performance in identifying fake images. However, there still lacks a systematic understanding of the adversarial robustness of AIGI detectors. In this paper, we examine the vulnerability of state-of-the-art AIGI detectors against adversarial attack under white-box and black-box settings, which has been rarely investigated so far. To this end, we propose a new method to attack AIGI detectors. First, inspired by the obvious difference between real images and fake images in the frequency domain, we add perturbations under the frequency domain to push the image away from its original frequency distribution. Second, we explore the full posterior distribution of the surrogate model to further narrow this gap between heterogeneous AIGI detectors, e.g., transferring adversarial examples across CNNs and ViTs. This is achieved by introducing a novel post-train Bayesian strategy that turns a single surrogate into a Bayesian one, capable of simulating diverse victim models using one pre-trained surrogate, without the need for re-training. We name our method as Frequency-based Post-train Bayesian Attack, or FPBA. Through FPBA, we demonstrate that adversarial attacks pose a real threat to AIGI detectors. FPBA can deliver successful black-box attacks across various detectors, generators, defense methods, and even evade cross-generator and compressed image detection, which are crucial real-world detection scenarios. Our code is available at https://github.com/onotoa/fpba.
Paper Structure (36 sections, 11 equations, 11 figures, 20 tables, 1 algorithm)

This paper contains 36 sections, 11 equations, 11 figures, 20 tables, 1 algorithm.

Figures (11)

  • Figure 1: A high-level illustration of our proposed method.
  • Figure 2: The workflow of FPBA. We add spatial-frequency adversarial perturbations to AI-generated images in a Bayesian manner, so that they are misclassified as real. DCT and IDCT are the discrete cosine transformation and inverse discrete cosine transformation, respectively.
  • Figure 3: Visualization of the spectrum saliency map (average 2000 images on GenImage datasets) for real and fake images across different models. (a): the results for conducting frequency spectrum transformation (N=10). (b$\sim$d): the results for raw images on different models. The color value represents the absolute gradient value of the model loss function after max-min normalization.
  • Figure 4: The architecture of the appended model. $\sigma$ means the sigmoid layer.
  • Figure 5: Visualization of the sensitive frequency components of real and fake images (average 1000 images on LSUN/ProGAN datasets) for frequency-based models. The frequency components of frequency-based models are highly sparse in comparison with spatial-based models. The color value represents the absolute gradient value of the model loss function after max-min normalization.
  • ...and 6 more figures