Robust AI-Synthesized Image Detection via Multi-feature Frequency-aware Learning
Hongfei Cai, Chi Liu, Sheng Shen, Youyang Qu, Peng Gui
TL;DR
The paper tackles the problem of robustly detecting AI-synthesized images under cross-model generalization and real-world perturbations. It proposes a multi-feature fusion framework that combines NPR, image gradient features, and CLIP-based semantic priors through cross-source attention, followed by a frequency-aware residual backbone using Frequency-Adaptive Dilated Convolution to jointly model spatial and spectral cues. Extensive experiments across fourteen GenAI models demonstrate superior generalization to unseen models and resilience to common post-processing operations, with ablations confirming the synergistic benefits of feature fusion and frequency-aware learning. The approach delivers practical impact by offering a robust detector capable of handling diverse forgery types and transmission noise, contributing to safer deployment of AI-generated content detection systems.
Abstract
The rapid progression of generative AI (GenAI) technologies has heightened concerns regarding the misuse of AI-generated imagery. To address this issue, robust detection methods have emerged as particularly compelling, especially in challenging conditions where the targeted GenAI models are out-of-distribution or the generated images have been subjected to perturbations during transmission. This paper introduces a multi-feature fusion framework designed to enhance spatial forensic feature representations with incorporating three complementary components, namely noise correlation analysis, image gradient information, and pretrained vision encoder knowledge, using a cross-source attention mechanism. Furthermore, to identify spectral abnormality in synthetic images, we propose a frequency-aware architecture that employs the Frequency-Adaptive Dilated Convolution, enabling the joint modeling of spatial and spectral features while maintaining low computational complexity. Our framework exhibits exceptional generalization performance across fourteen diverse GenAI systems, including text-to-image diffusion models, autoregressive approaches, and post-processed deepfake methods. Notably, it achieves significantly higher mean accuracy in cross-model detection tasks when compared to existing state-of-the-art techniques. Additionally, the proposed method demonstrates resilience against various types of real-world image noise perturbations such as compression and blurring. Extensive ablation studies further corroborate the synergistic benefits of fusing multi-model forensic features with frequency-aware learning, underscoring the efficacy of our approach.
