Table of Contents
Fetching ...

CAHS-Attack: CLIP-Aware Heuristic Search Attack Method for Stable Diffusion

Shuhan Xia, Jing Dai, Hui Ouyang, Yadong Shang, Dongxiao Zhao, Peipei Li

TL;DR

This work addresses the fragility of diffusion-based text-to-image models to adversarial prompts under a black-box setting. It introduces CAHS-Attack, a CLIP-aware heuristic search method that combines a constrained genetic-algorithm root-node selection with Monte Carlo Tree Search to perform efficient suffix-level perturbations in the CLIP embedding space. Empirical results on ImageNet-Short and ImageNet-Long show CAHS-Attack achieves state-of-the-art attack performance, decreasing text-semantics similarity and degrading CLIP alignment and image quality (e.g., $TS\approx0.185$ short, $TS\approx0.328$ long; $FID$ and $CS$ metrics worsen accordingly). The findings reveal a fundamental security risk in CLIP-conditioned pipelines, attributed to the inherent fragility of the text encoder, and highlight the need for defenses against black-box, CLIP-guided prompt perturbations.

Abstract

Diffusion models exhibit notable fragility when faced with adversarial prompts, and strengthening attack capabilities is crucial for uncovering such vulnerabilities and building more robust generative systems. Existing works often rely on white-box access to model gradients or hand-crafted prompt engineering, which is infeasible in real-world deployments due to restricted access or poor attack effect. In this paper, we propose CAHS-Attack , a CLIP-Aware Heuristic Search attack method. CAHS-Attack integrates Monte Carlo Tree Search (MCTS) to perform fine-grained suffix optimization, leveraging a constrained genetic algorithm to preselect high-potential adversarial prompts as root nodes, and retaining the most semantically disruptive outcome at each simulation rollout for efficient local search. Extensive experiments demonstrate that our method achieves state-of-the-art attack performance across both short and long prompts of varying semantics. Furthermore, we find that the fragility of SD models can be attributed to the inherent vulnerability of their CLIP-based text encoders, suggesting a fundamental security risk in current text-to-image pipelines.

CAHS-Attack: CLIP-Aware Heuristic Search Attack Method for Stable Diffusion

TL;DR

This work addresses the fragility of diffusion-based text-to-image models to adversarial prompts under a black-box setting. It introduces CAHS-Attack, a CLIP-aware heuristic search method that combines a constrained genetic-algorithm root-node selection with Monte Carlo Tree Search to perform efficient suffix-level perturbations in the CLIP embedding space. Empirical results on ImageNet-Short and ImageNet-Long show CAHS-Attack achieves state-of-the-art attack performance, decreasing text-semantics similarity and degrading CLIP alignment and image quality (e.g., short, long; and metrics worsen accordingly). The findings reveal a fundamental security risk in CLIP-conditioned pipelines, attributed to the inherent fragility of the text encoder, and highlight the need for defenses against black-box, CLIP-guided prompt perturbations.

Abstract

Diffusion models exhibit notable fragility when faced with adversarial prompts, and strengthening attack capabilities is crucial for uncovering such vulnerabilities and building more robust generative systems. Existing works often rely on white-box access to model gradients or hand-crafted prompt engineering, which is infeasible in real-world deployments due to restricted access or poor attack effect. In this paper, we propose CAHS-Attack , a CLIP-Aware Heuristic Search attack method. CAHS-Attack integrates Monte Carlo Tree Search (MCTS) to perform fine-grained suffix optimization, leveraging a constrained genetic algorithm to preselect high-potential adversarial prompts as root nodes, and retaining the most semantically disruptive outcome at each simulation rollout for efficient local search. Extensive experiments demonstrate that our method achieves state-of-the-art attack performance across both short and long prompts of varying semantics. Furthermore, we find that the fragility of SD models can be attributed to the inherent vulnerability of their CLIP-based text encoders, suggesting a fundamental security risk in current text-to-image pipelines.

Paper Structure

This paper contains 14 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Illustrating the Fragility of Stable Diffusion under Adversarial Prompt Perturbations. The attacked characters are highlighted in red.
  • Figure 2: Overview of the six steps in CAHS-Attack. We begin with applying a constrained mutation-based strategy (left) to generate semantically plausible yet adversarially potent candidates as root nodes for MCTS. The subsequent MCTS process (middle) iteratively expands suffixes through selection, expansion, evaluation, simulation, and backpropagation, guided entirely by cosine similarity in the CLIP embedding space. The final adversarial prompt is then used to query the SD (right)
  • Figure 3: Visual comparison of image outputs under different prompt attack methods on short prompts. Our method generates semantically divergent images.