HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models
Sensen Gao, Xiaojun Jia, Yihao Huang, Ranjie Duan, Jindong Gu, Yang Bai, Yang Liu, Qing Guo
TL;DR
This work tackles NSFW jailbreaks in text-to-image models by proposing HTS-Attack, a fully black-box, query-based method that operates in two stages: Sensitive Token Removal Initialization to reduce prompt sensitivity, and Heuristic Token Search to iteratively optimize adversarial prompts. By leveraging a CLIP-based surrogate and a population-inspired search with recombination and mutation, HTS-Attack bypasses prompt checkers, post-hoc image checkers, securely trained models, and online systems, while preserving the semantic target of the NSFW prompt. Empirical results across multiple defenses and model classes show HTS-Attack achieving high bypass rates and strong semantic fidelity (BLIP similarity), often outperforming gradient-based and RL-based baselines. The findings underscore the need for more robust defenses that address discrete-token spaces and dynamic defense adaptations in T2I systems, with practical implications for policy and safety mechanisms in image generation.
Abstract
Text-to-Image(T2I) models have achieved remarkable success in image generation and editing, yet these models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content. Strengthening attacks and uncovering such vulnerabilities can advance the development of reliable and practical T2I models. Most of the previous works treat T2I models as white-box systems, using gradient optimization to generate adversarial prompts. However, accessing the model's gradient is often impossible in real-world scenarios. Moreover, existing defense methods, those using gradient masking, are designed to prevent attackers from obtaining accurate gradient information. While several black-box jailbreak attacks have been explored, they achieve the limited performance of jailbreaking T2I models due to difficulties associated with optimization in discrete spaces. To address this, we propose HTS-Attack, a heuristic token search attack method. HTS-Attack begins with an initialization that removes sensitive tokens, followed by a heuristic search where high-performing candidates are recombined and mutated. This process generates a new pool of candidates, and the optimal adversarial prompt is updated based on their effectiveness. By incorporating both optimal and suboptimal candidates, HTS-Attack avoids local optima and improves robustness in bypassing defenses. Extensive experiments validate the effectiveness of our method in attacking the latest prompt checkers, post-hoc image checkers, securely trained T2I models, and online commercial models.
