Trinity Detector:text-assisted and attention mechanisms based spectral fusion for diffusion generation image detection
Jiawei Song, Dengpan Ye, Yunming Zhang
TL;DR
The paper tackles the challenge of detecting images generated by diffusion models, which often evade traditional forgery detectors. It introduces Trinity Detector, a multimodal framework that combines a Multispectral Channel Attention Fusion Unit for adaptive frequency-band fusion with CLIP-based text-image alignment to capture spectral and semantic cues. A new TxtDiffusionForensics dataset of diffusion-generated image-text pairs is released to benchmark detectors across Stable Diffusion and GLIDE. Empirical results demonstrate strong generalization and robustness to unseen diffusion models, highlighting significant improvements in transferability and detection reliability for diffusion-based forgeries.
Abstract
Artificial Intelligence Generated Content (AIGC) techniques, represented by text-to-image generation, have led to a malicious use of deep forgeries, raising concerns about the trustworthiness of multimedia content. Adapting traditional forgery detection methods to diffusion models proves challenging. Thus, this paper proposes a forgery detection method explicitly designed for diffusion models called Trinity Detector. Trinity Detector incorporates coarse-grained text features through a CLIP encoder, coherently integrating them with fine-grained artifacts in the pixel domain for comprehensive multimodal detection. To heighten sensitivity to diffusion-generated image features, a Multi-spectral Channel Attention Fusion Unit (MCAF) is designed, extracting spectral inconsistencies through adaptive fusion of diverse frequency bands and further integrating spatial co-occurrence of the two modalities. Extensive experimentation validates that our Trinity Detector method outperforms several state-of-the-art methods, our performance is competitive across all datasets and up to 17.6\% improvement in transferability in the diffusion datasets.
