Table of Contents
Fetching ...

OpenSDI: Spotting Diffusion-Generated Images in the Open World

Yabin Wang, Zhiwu Huang, Xiaopeng Hong

TL;DR

This work tackles open-world spotting of diffusion-generated images (OpenSDI) by introducing the OpenSDID benchmark, which captures user diversity, model innovation, and manipulation scope. It proposes Synergizing Pretrained Models (SPM) and the MaskCLIP model, a CLIP+MAE fusion guided by prompting and cross-attention (VCA, TVCA, VSA) to achieve robust detection and precise localization without extensive fine-tuning. Extensive experiments show MaskCLIP achieving state-of-the-art performance across in-domain and cross-domain settings, with notable relative gains in $IoU$ and $F1$ for localization and in $F1$ and accuracy for detection. The dataset and code are publicly available, enabling ongoing benchmarking as diffusion models continue to evolve and open-world forgery detection becomes increasingly important.

Abstract

This paper identifies OpenSDI, a challenge for spotting diffusion-generated images in open-world settings. In response to this challenge, we define a new benchmark, the OpenSDI dataset (OpenSDID), which stands out from existing datasets due to its diverse use of large vision-language models that simulate open-world diffusion-based manipulations. Another outstanding feature of OpenSDID is its inclusion of both detection and localization tasks for images manipulated globally and locally by diffusion models. To address the OpenSDI challenge, we propose a Synergizing Pretrained Models (SPM) scheme to build up a mixture of foundation models. This approach exploits a collaboration mechanism with multiple pretrained foundation models to enhance generalization in the OpenSDI context, moving beyond traditional training by synergizing multiple pretrained models through prompting and attending strategies. Building on this scheme, we introduce MaskCLIP, an SPM-based model that aligns Contrastive Language-Image Pre-Training (CLIP) with Masked Autoencoder (MAE). Extensive evaluations on OpenSDID show that MaskCLIP significantly outperforms current state-of-the-art methods for the OpenSDI challenge, achieving remarkable relative improvements of 14.23% in IoU (14.11% in F1) and 2.05% in accuracy (2.38% in F1) compared to the second-best model in localization and detection tasks, respectively. Our dataset and code are available at https://github.com/iamwangyabin/OpenSDI.

OpenSDI: Spotting Diffusion-Generated Images in the Open World

TL;DR

This work tackles open-world spotting of diffusion-generated images (OpenSDI) by introducing the OpenSDID benchmark, which captures user diversity, model innovation, and manipulation scope. It proposes Synergizing Pretrained Models (SPM) and the MaskCLIP model, a CLIP+MAE fusion guided by prompting and cross-attention (VCA, TVCA, VSA) to achieve robust detection and precise localization without extensive fine-tuning. Extensive experiments show MaskCLIP achieving state-of-the-art performance across in-domain and cross-domain settings, with notable relative gains in and for localization and in and accuracy for detection. The dataset and code are publicly available, enabling ongoing benchmarking as diffusion models continue to evolve and open-world forgery detection becomes increasingly important.

Abstract

This paper identifies OpenSDI, a challenge for spotting diffusion-generated images in open-world settings. In response to this challenge, we define a new benchmark, the OpenSDI dataset (OpenSDID), which stands out from existing datasets due to its diverse use of large vision-language models that simulate open-world diffusion-based manipulations. Another outstanding feature of OpenSDID is its inclusion of both detection and localization tasks for images manipulated globally and locally by diffusion models. To address the OpenSDI challenge, we propose a Synergizing Pretrained Models (SPM) scheme to build up a mixture of foundation models. This approach exploits a collaboration mechanism with multiple pretrained foundation models to enhance generalization in the OpenSDI context, moving beyond traditional training by synergizing multiple pretrained models through prompting and attending strategies. Building on this scheme, we introduce MaskCLIP, an SPM-based model that aligns Contrastive Language-Image Pre-Training (CLIP) with Masked Autoencoder (MAE). Extensive evaluations on OpenSDID show that MaskCLIP significantly outperforms current state-of-the-art methods for the OpenSDI challenge, achieving remarkable relative improvements of 14.23% in IoU (14.11% in F1) and 2.05% in accuracy (2.38% in F1) compared to the second-best model in localization and detection tasks, respectively. Our dataset and code are available at https://github.com/iamwangyabin/OpenSDI.

Paper Structure

This paper contains 7 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Three open-world settings of the OpenSDI challenge: (1) user diversity by simulating a range of user preferences, (2) model innovation through the use of multiple advanced diffusion models, and (3) a full manipulation scope that enables both global and local generation.
  • Figure 2: OpenSDID examples of authentic and manipulated images with their corresponding ground-truth masks, and entire generated images.
  • Figure 3: MaskCLIP overview. MaskCLIP inherits core components from pretrained CLIP and MAE: the CLIP vision and text encoders, the MAE encoder, and uses an FPN-style decoder for precise pixel-level predictions. To synergize CLIP and MAE, MaskCLIP introduces one prompting block (top right) and three attention blocks: VCA (visual cross-attention), TVCA (textual-visual cross-attention), and VSA (visual self-attention). Frozen components are marked with snowflakes, while tunable components (marked with flames) are optimized using a balanced objective with $\mathcal{L}_{\text{CE}}$ (cross-entropy), $\mathcal{L}_{\text{BCE}}$ (binary cross-entropy), and $\mathcal{L}_{\text{EDG}}$ (edge-weighted loss).
  • Figure 4: Qualitative results on OpenSDID.
  • Figure 5: Robustness evaluation of different SOTA methods under image degradation. It compares performance across varying levels of Gaussian Blur (left) and JPEG Compression (right) on both in-domain (SD1.5) and cross-domain (SD3) test sets. Results on the rest test data are provided in the suppl. material.