Table of Contents
Fetching ...

PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors

Sepehr Dehdashtian, Mashrur M. Morshed, Jacob H. Seidman, Gaurav Bharaj, Vishnu Naresh Boddeti

TL;DR

PolyJuice presents the first black-box, universal unrestricted attack against synthetic image detectors by discovering a realness shift in the text-to-image latent space via SPCA and HSIC. It computes time-varying steering directions that universally steer generated images toward SID failure regions, enabling efficient attacks that transfer across resolutions and attack different detectors, including model-tuned SIDs. The approach achieves substantial attack success (up to 84% improvement) and demonstrates that low-resolution directions can effectively translate to high-resolution images, while also revealing that PolyJuice can enhance SID robustness when used to augment training data (up to 30% FNR reduction). These findings offer a practical, scalable red-teaming tool for evaluating and hardening SIDs, and they motivate defense strategies that account for distribution-based, image-agnostic attacks.

Abstract

Synthetic image detectors (SIDs) are a key defense against the risks posed by the growing realism of images from text-to-image (T2I) models. Red teaming improves SID's effectiveness by identifying and exploiting their failure modes via misclassified synthetic images. However, existing red-teaming solutions (i) require white-box access to SIDs, which is infeasible for proprietary state-of-the-art detectors, and (ii) generate image-specific attacks through expensive online optimization. To address these limitations, we propose PolyJuice, the first black-box, image-agnostic red-teaming method for SIDs, based on an observed distribution shift in the T2I latent space between samples correctly and incorrectly classified by the SID. PolyJuice generates attacks by (i) identifying the direction of this shift through a lightweight offline process that only requires black-box access to the SID, and (ii) exploiting this direction by universally steering all generated images towards the SID's failure modes. PolyJuice-steered T2I models are significantly more effective at deceiving SIDs (up to 84%) compared to their unsteered counterparts. We also show that the steering directions can be estimated efficiently at lower resolutions and transferred to higher resolutions using simple interpolation, reducing computational overhead. Finally, tuning SID models on PolyJuice-augmented datasets notably enhances the performance of the detectors (up to 30%).

PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors

TL;DR

PolyJuice presents the first black-box, universal unrestricted attack against synthetic image detectors by discovering a realness shift in the text-to-image latent space via SPCA and HSIC. It computes time-varying steering directions that universally steer generated images toward SID failure regions, enabling efficient attacks that transfer across resolutions and attack different detectors, including model-tuned SIDs. The approach achieves substantial attack success (up to 84% improvement) and demonstrates that low-resolution directions can effectively translate to high-resolution images, while also revealing that PolyJuice can enhance SID robustness when used to augment training data (up to 30% FNR reduction). These findings offer a practical, scalable red-teaming tool for evaluating and hardening SIDs, and they motivate defense strategies that account for distribution-based, image-agnostic attacks.

Abstract

Synthetic image detectors (SIDs) are a key defense against the risks posed by the growing realism of images from text-to-image (T2I) models. Red teaming improves SID's effectiveness by identifying and exploiting their failure modes via misclassified synthetic images. However, existing red-teaming solutions (i) require white-box access to SIDs, which is infeasible for proprietary state-of-the-art detectors, and (ii) generate image-specific attacks through expensive online optimization. To address these limitations, we propose PolyJuice, the first black-box, image-agnostic red-teaming method for SIDs, based on an observed distribution shift in the T2I latent space between samples correctly and incorrectly classified by the SID. PolyJuice generates attacks by (i) identifying the direction of this shift through a lightweight offline process that only requires black-box access to the SID, and (ii) exploiting this direction by universally steering all generated images towards the SID's failure modes. PolyJuice-steered T2I models are significantly more effective at deceiving SIDs (up to 84%) compared to their unsteered counterparts. We also show that the steering directions can be estimated efficiently at lower resolutions and transferred to higher resolutions using simple interpolation, reducing computational overhead. Finally, tuning SID models on PolyJuice-augmented datasets notably enhances the performance of the detectors (up to 30%).

Paper Structure

This paper contains 35 sections, 8 equations, 22 figures, 11 tables, 3 algorithms.

Figures (22)

  • Figure 1: (a) PolyJuice steers text-to-image (T2I) models to generate images that deceive a synthetic image detection (SID) model. (b) There exists a clearly observable shift between the distribution of the samples predicted as real versus those identified as fake, in the latent space of T2I models.
  • Figure 2: Overview of PolyJuice: At each inference step $t$, PolyJuice manipulates the T2I latent using pre-computed direction $\bm \delta_t$ between predicted real and fake in order to deceive the target SID.
  • Figure 3: Unsteered vs. PolyJuice-steered T2I attacks against UFD. Although UFD detects the unsteered generated images as fake, PolyJuice-disguised samples successfully deceive the detector.
  • Figure 4: Estimated clean images at various timesteps for $\text{FLUX}_{\text{[dev]}}$, where bottom and top rows depict unsteered and PolyJuice-steered generation, respectively.
  • Figure 5: (left) Unsteered (right) PolyJuice-steered samples projected on 2D subspace.
  • ...and 17 more figures