Table of Contents
Fetching ...

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

Guanlin Li, Kangjie Chen, Shudong Zhang, Jie Zhang, Tianwei Zhang

TL;DR

This work proposes a novel Automatic Red-Teaming framework, ART, which leverages both vision language model and large language model to establish a connection between unsafe generations and their prompts, thereby more efficiently identifying the model's vulnerabilities.

Abstract

Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content. Meanwhile, safeguards for these generative models are developed, to protect users' rights and safety, most of which are designed for large language models. Existing methods primarily focus on jailbreak and adversarial attacks, which mainly evaluate the model's safety under malicious prompts. Recent work found that manually crafted safe prompts can unintentionally trigger unsafe generations. To further systematically evaluate the safety risks of text-to-image models, we propose a novel Automatic Red-Teaming framework, ART. Our method leverages both vision language model and large language model to establish a connection between unsafe generations and their prompts, thereby more efficiently identifying the model's vulnerabilities. With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models. The experiments also validate the effectiveness, adaptability, and great diversity of ART. Additionally, we introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models. Datasets and models can be found in https://github.com/GuanlinLee/ART.

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users

TL;DR

This work proposes a novel Automatic Red-Teaming framework, ART, which leverages both vision language model and large language model to establish a connection between unsafe generations and their prompts, thereby more efficiently identifying the model's vulnerabilities.

Abstract

Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content. Meanwhile, safeguards for these generative models are developed, to protect users' rights and safety, most of which are designed for large language models. Existing methods primarily focus on jailbreak and adversarial attacks, which mainly evaluate the model's safety under malicious prompts. Recent work found that manually crafted safe prompts can unintentionally trigger unsafe generations. To further systematically evaluate the safety risks of text-to-image models, we propose a novel Automatic Red-Teaming framework, ART. Our method leverages both vision language model and large language model to establish a connection between unsafe generations and their prompts, thereby more efficiently identifying the model's vulnerabilities. With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models. The experiments also validate the effectiveness, adaptability, and great diversity of ART. Additionally, we introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models. Datasets and models can be found in https://github.com/GuanlinLee/ART.
Paper Structure (31 sections, 13 figures, 14 tables)

This paper contains 31 sections, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Safe prompts can lead to harmful and illegal images. Prompts are shown below the images.
  • Figure 2: Pipeline of ART after initialization round.
  • Figure 3: Ratio of safe prompt and success ratio for unsafe images under different Stable Diffusion generation settings. Results for other categories can be found in Appendix \ref{['app:full']}.
  • Figure 4: Diversity of generated prompts for categories. Dash lines stand for the results of Curiosity.
  • Figure 5: Impacts of guidance scales and image resolutions in the red-teaming process.
  • ...and 8 more figures