Table of Contents
Fetching ...

Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling

Yichuan Cao, Yibo Miao, Xiao-Shan Gao, Yinpeng Dong

TL;DR

This work addresses safety vulnerabilities in text-to-image systems by introducing RPG-RT, a rule-based preference modeling guided red-teaming framework that operates in realistic commercial black-box settings. It uses an LLM to iteratively modify prompts, a detector and a scoring model to provide fine-grained feedback, and Direct Preference Optimization with LoRA to refine the LLM based on binary and scalar preferences. The approach demonstrates superior attack success rates across 19 T2I models and multiple online APIs while preserving semantic content, and it generalizes to unseen prompts and to text-to-video scenarios. Practically, RPG-RT offers a robust, scalable method for evaluating and stress-testing safety defenses in real-world deployed T2I systems, highlighting areas for strengthening filters and alignment in commercial APIs.

Abstract

Text-to-image (T2I) models raise ethical and safety concerns due to their potential to generate inappropriate or harmful images. Evaluating these models' security through red-teaming is vital, yet white-box approaches are limited by their need for internal access, complicating their use with closed-source models. Moreover, existing black-box methods often assume knowledge about the model's specific defense mechanisms, limiting their utility in real-world commercial API scenarios. A significant challenge is how to evade unknown and diverse defense mechanisms. To overcome this difficulty, we propose a novel Rule-based Preference modeling Guided Red-Teaming (RPG-RT), which iteratively employs LLM to modify prompts to query and leverages feedback from T2I systems for fine-tuning the LLM. RPG-RT treats the feedback from each iteration as a prior, enabling the LLM to dynamically adapt to unknown defense mechanisms. Given that the feedback is often labeled and coarse-grained, making it difficult to utilize directly, we further propose rule-based preference modeling, which employs a set of rules to evaluate desired or undesired feedback, facilitating finer-grained control over the LLM's dynamic adaptation process. Extensive experiments on nineteen T2I systems with varied safety mechanisms, three online commercial API services, and T2V models verify the superiority and practicality of our approach.

Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling

TL;DR

This work addresses safety vulnerabilities in text-to-image systems by introducing RPG-RT, a rule-based preference modeling guided red-teaming framework that operates in realistic commercial black-box settings. It uses an LLM to iteratively modify prompts, a detector and a scoring model to provide fine-grained feedback, and Direct Preference Optimization with LoRA to refine the LLM based on binary and scalar preferences. The approach demonstrates superior attack success rates across 19 T2I models and multiple online APIs while preserving semantic content, and it generalizes to unseen prompts and to text-to-video scenarios. Practically, RPG-RT offers a robust, scalable method for evaluating and stress-testing safety defenses in real-world deployed T2I systems, highlighting areas for strengthening filters and alignment in commercial APIs.

Abstract

Text-to-image (T2I) models raise ethical and safety concerns due to their potential to generate inappropriate or harmful images. Evaluating these models' security through red-teaming is vital, yet white-box approaches are limited by their need for internal access, complicating their use with closed-source models. Moreover, existing black-box methods often assume knowledge about the model's specific defense mechanisms, limiting their utility in real-world commercial API scenarios. A significant challenge is how to evade unknown and diverse defense mechanisms. To overcome this difficulty, we propose a novel Rule-based Preference modeling Guided Red-Teaming (RPG-RT), which iteratively employs LLM to modify prompts to query and leverages feedback from T2I systems for fine-tuning the LLM. RPG-RT treats the feedback from each iteration as a prior, enabling the LLM to dynamically adapt to unknown defense mechanisms. Given that the feedback is often labeled and coarse-grained, making it difficult to utilize directly, we further propose rule-based preference modeling, which employs a set of rules to evaluate desired or undesired feedback, facilitating finer-grained control over the LLM's dynamic adaptation process. Extensive experiments on nineteen T2I systems with varied safety mechanisms, three online commercial API services, and T2V models verify the superiority and practicality of our approach.

Paper Structure

This paper contains 36 sections, 7 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Overview of our RPG-RT framework. a) Stage 1: The LLM generates multiple different modifications of the prompt, then inputs them into the target T2I blaomk-box system and obtains the outputs. b) Stage 2: A binary partial order is constructed to model the preferences of the T2I system. Rule-based scoring is utilized to enable fine-grained control over the LLM's exploration of the commercial black-box system. c) Stage 3: The LLM agent is fine-tuned using DPO based on the generative preferences of the target T2I system.
  • Figure 2: Overview of our scoring model. a): Motivation: the presence of harmful or semantically identical non-harmful semantics can lead to a high CLIP similarity between two images, causing confusion that cannot be resolved by a straightforward CLIP similarity measure. b): Our key insight is to decouple the CLIP representation using a transformation $f=(f_n,f_s)$, where $f_n$ captures harmful content, and $f_s$ captures other innocuous semantics, allowing separation of the representation and a clearer distinction from confusion. c): To train our scoring model, we design four loss functions tailored to address the intensity of harmful semantics, the invariance of benign semantics, the similarity between benign semantics, and the reconstructability of information.
  • Figure 3: Qualitative visualization results of baselines and our RPG-RT. Our RPG-RT can a): effectively bypass the safety checker and generate images across various NSFW categories, b): generate pornographic images on multiple APIs, and c): generalize to text-to-video systems.
  • Figure 4: Full qualitative visualization results of baselines and our RPG-RT in generating images with nudity semantics on nineteen T2I systems equipped with various defense mechanisms.
  • Figure 5: Full qualitative visualization results of baselines and our RPG-RT across various NSFW categories.
  • ...and 2 more figures