Table of Contents
Fetching ...

Trinity Detector:text-assisted and attention mechanisms based spectral fusion for diffusion generation image detection

Jiawei Song, Dengpan Ye, Yunming Zhang

TL;DR

The paper tackles the challenge of detecting images generated by diffusion models, which often evade traditional forgery detectors. It introduces Trinity Detector, a multimodal framework that combines a Multispectral Channel Attention Fusion Unit for adaptive frequency-band fusion with CLIP-based text-image alignment to capture spectral and semantic cues. A new TxtDiffusionForensics dataset of diffusion-generated image-text pairs is released to benchmark detectors across Stable Diffusion and GLIDE. Empirical results demonstrate strong generalization and robustness to unseen diffusion models, highlighting significant improvements in transferability and detection reliability for diffusion-based forgeries.

Abstract

Artificial Intelligence Generated Content (AIGC) techniques, represented by text-to-image generation, have led to a malicious use of deep forgeries, raising concerns about the trustworthiness of multimedia content. Adapting traditional forgery detection methods to diffusion models proves challenging. Thus, this paper proposes a forgery detection method explicitly designed for diffusion models called Trinity Detector. Trinity Detector incorporates coarse-grained text features through a CLIP encoder, coherently integrating them with fine-grained artifacts in the pixel domain for comprehensive multimodal detection. To heighten sensitivity to diffusion-generated image features, a Multi-spectral Channel Attention Fusion Unit (MCAF) is designed, extracting spectral inconsistencies through adaptive fusion of diverse frequency bands and further integrating spatial co-occurrence of the two modalities. Extensive experimentation validates that our Trinity Detector method outperforms several state-of-the-art methods, our performance is competitive across all datasets and up to 17.6\% improvement in transferability in the diffusion datasets.

Trinity Detector:text-assisted and attention mechanisms based spectral fusion for diffusion generation image detection

TL;DR

The paper tackles the challenge of detecting images generated by diffusion models, which often evade traditional forgery detectors. It introduces Trinity Detector, a multimodal framework that combines a Multispectral Channel Attention Fusion Unit for adaptive frequency-band fusion with CLIP-based text-image alignment to capture spectral and semantic cues. A new TxtDiffusionForensics dataset of diffusion-generated image-text pairs is released to benchmark detectors across Stable Diffusion and GLIDE. Empirical results demonstrate strong generalization and robustness to unseen diffusion models, highlighting significant improvements in transferability and detection reliability for diffusion-based forgeries.

Abstract

Artificial Intelligence Generated Content (AIGC) techniques, represented by text-to-image generation, have led to a malicious use of deep forgeries, raising concerns about the trustworthiness of multimedia content. Adapting traditional forgery detection methods to diffusion models proves challenging. Thus, this paper proposes a forgery detection method explicitly designed for diffusion models called Trinity Detector. Trinity Detector incorporates coarse-grained text features through a CLIP encoder, coherently integrating them with fine-grained artifacts in the pixel domain for comprehensive multimodal detection. To heighten sensitivity to diffusion-generated image features, a Multi-spectral Channel Attention Fusion Unit (MCAF) is designed, extracting spectral inconsistencies through adaptive fusion of diverse frequency bands and further integrating spatial co-occurrence of the two modalities. Extensive experimentation validates that our Trinity Detector method outperforms several state-of-the-art methods, our performance is competitive across all datasets and up to 17.6\% improvement in transferability in the diffusion datasets.
Paper Structure (16 sections, 10 equations, 2 figures, 2 tables)

This paper contains 16 sections, 10 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Diffusion generation process and comparison of spectrogram of real and fake images after DCT transformation.
  • Figure 2: An illustration of diffusion—generated images detection. The Text-Image Alignment and Extraction Module processes textual and visual information pairs, extracting aligned content information. Within the SpectraFuse Unit, DCT vectors extracted through the DCT Selection Criterion are applied to perform DCT transformation on individual channels of the image. Subsequently, a channel attention mechanism is employed to fuse frequency domain information.