Table of Contents
Fetching ...

Robust AI-Synthesized Image Detection via Multi-feature Frequency-aware Learning

Hongfei Cai, Chi Liu, Sheng Shen, Youyang Qu, Peng Gui

TL;DR

The paper tackles the problem of robustly detecting AI-synthesized images under cross-model generalization and real-world perturbations. It proposes a multi-feature fusion framework that combines NPR, image gradient features, and CLIP-based semantic priors through cross-source attention, followed by a frequency-aware residual backbone using Frequency-Adaptive Dilated Convolution to jointly model spatial and spectral cues. Extensive experiments across fourteen GenAI models demonstrate superior generalization to unseen models and resilience to common post-processing operations, with ablations confirming the synergistic benefits of feature fusion and frequency-aware learning. The approach delivers practical impact by offering a robust detector capable of handling diverse forgery types and transmission noise, contributing to safer deployment of AI-generated content detection systems.

Abstract

The rapid progression of generative AI (GenAI) technologies has heightened concerns regarding the misuse of AI-generated imagery. To address this issue, robust detection methods have emerged as particularly compelling, especially in challenging conditions where the targeted GenAI models are out-of-distribution or the generated images have been subjected to perturbations during transmission. This paper introduces a multi-feature fusion framework designed to enhance spatial forensic feature representations with incorporating three complementary components, namely noise correlation analysis, image gradient information, and pretrained vision encoder knowledge, using a cross-source attention mechanism. Furthermore, to identify spectral abnormality in synthetic images, we propose a frequency-aware architecture that employs the Frequency-Adaptive Dilated Convolution, enabling the joint modeling of spatial and spectral features while maintaining low computational complexity. Our framework exhibits exceptional generalization performance across fourteen diverse GenAI systems, including text-to-image diffusion models, autoregressive approaches, and post-processed deepfake methods. Notably, it achieves significantly higher mean accuracy in cross-model detection tasks when compared to existing state-of-the-art techniques. Additionally, the proposed method demonstrates resilience against various types of real-world image noise perturbations such as compression and blurring. Extensive ablation studies further corroborate the synergistic benefits of fusing multi-model forensic features with frequency-aware learning, underscoring the efficacy of our approach.

Robust AI-Synthesized Image Detection via Multi-feature Frequency-aware Learning

TL;DR

The paper tackles the problem of robustly detecting AI-synthesized images under cross-model generalization and real-world perturbations. It proposes a multi-feature fusion framework that combines NPR, image gradient features, and CLIP-based semantic priors through cross-source attention, followed by a frequency-aware residual backbone using Frequency-Adaptive Dilated Convolution to jointly model spatial and spectral cues. Extensive experiments across fourteen GenAI models demonstrate superior generalization to unseen models and resilience to common post-processing operations, with ablations confirming the synergistic benefits of feature fusion and frequency-aware learning. The approach delivers practical impact by offering a robust detector capable of handling diverse forgery types and transmission noise, contributing to safer deployment of AI-generated content detection systems.

Abstract

The rapid progression of generative AI (GenAI) technologies has heightened concerns regarding the misuse of AI-generated imagery. To address this issue, robust detection methods have emerged as particularly compelling, especially in challenging conditions where the targeted GenAI models are out-of-distribution or the generated images have been subjected to perturbations during transmission. This paper introduces a multi-feature fusion framework designed to enhance spatial forensic feature representations with incorporating three complementary components, namely noise correlation analysis, image gradient information, and pretrained vision encoder knowledge, using a cross-source attention mechanism. Furthermore, to identify spectral abnormality in synthetic images, we propose a frequency-aware architecture that employs the Frequency-Adaptive Dilated Convolution, enabling the joint modeling of spatial and spectral features while maintaining low computational complexity. Our framework exhibits exceptional generalization performance across fourteen diverse GenAI systems, including text-to-image diffusion models, autoregressive approaches, and post-processed deepfake methods. Notably, it achieves significantly higher mean accuracy in cross-model detection tasks when compared to existing state-of-the-art techniques. Additionally, the proposed method demonstrates resilience against various types of real-world image noise perturbations such as compression and blurring. Extensive ablation studies further corroborate the synergistic benefits of fusing multi-model forensic features with frequency-aware learning, underscoring the efficacy of our approach.

Paper Structure

This paper contains 25 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The architecture of the proposed multi-feature frequency-aware learning model. The multi-branch network integrates CLIP semantic features, transformation gradients, and NPR (noise pattern residual) features. Frequency decomposition via Discrete Wavelet Transform separates low- and high-frequency components, followed by residual-enhanced feature refinement and attention-based feature weighting. Final classification is achieved through hierarchical convolutional blocks and fully connected layers.
  • Figure 2: The t-SNE visualization of 2000 test images demonstrates the effectiveness of a frequency domain module in enhancing binary classification accuracy between real (blue) and fake (red) images. The left panel represents the feature distribution without the frequency domain module, while the right panel shows the improved separation after its application.
  • Figure 3: Robustness of different detection methods to various image compressions and noises. Average AP scores of the baseline model Wang_2020_CVPR and ours across two types of generative models (GAN and Diffusion) under different qualities of JPEG compression (top row) and different levels of Gaussian noise (Bottom row) are compared.