Table of Contents
Fetching ...

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

Jonathan Nöther, Adish Singla, Goran Radanović

TL;DR

This work addresses targeted safety assessment of large language models by constraining red-teaming prompts to be near a reference dataset. It proposes DART, a diffusion-based black-box method that perturbs reference prompts in embedding space within a budget, training with PPO and a proximity regularizer to maximize harmful outputs from a target model. Across multiple target models and datasets, DART outperforms RL and prompting baselines in finding toxic prompts near references, enabling precise identification of topics and styles where defenses succeed or fail. The approach offers a practical tool for focused safety audits, informing targeted improvements in alignment and guardrails, and suggests directions for extending to multi-turn conversations and automatic budget selection.

Abstract

Recent work has proposed automated red-teaming methods for testing the vulnerabilities of a given target large language model (LLM). These methods use red-teaming LLMs to uncover inputs that induce harmful behavior in a target LLM. In this paper, we study red-teaming strategies that enable a targeted security assessment. We propose an optimization framework for red-teaming with proximity constraints, where the discovered prompts must be similar to reference prompts from a given dataset. This dataset serves as a template for the discovered prompts, anchoring the search for test-cases to specific topics, writing styles, or types of harmful behavior. We show that established auto-regressive model architectures do not perform well in this setting. We therefore introduce a black-box red-teaming method inspired by text-diffusion models: Diffusion for Auditing and Red-Teaming (DART). DART modifies the reference prompt by perturbing it in the embedding space, directly controlling the amount of change introduced. We systematically evaluate our method by comparing its effectiveness with established methods based on model fine-tuning and zero- and few-shot prompting. Our results show that DART is significantly more effective at discovering harmful inputs in close proximity to the reference prompt.

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

TL;DR

This work addresses targeted safety assessment of large language models by constraining red-teaming prompts to be near a reference dataset. It proposes DART, a diffusion-based black-box method that perturbs reference prompts in embedding space within a budget, training with PPO and a proximity regularizer to maximize harmful outputs from a target model. Across multiple target models and datasets, DART outperforms RL and prompting baselines in finding toxic prompts near references, enabling precise identification of topics and styles where defenses succeed or fail. The approach offers a practical tool for focused safety audits, informing targeted improvements in alignment and guardrails, and suggests directions for extending to multi-turn conversations and automatic budget selection.

Abstract

Recent work has proposed automated red-teaming methods for testing the vulnerabilities of a given target large language model (LLM). These methods use red-teaming LLMs to uncover inputs that induce harmful behavior in a target LLM. In this paper, we study red-teaming strategies that enable a targeted security assessment. We propose an optimization framework for red-teaming with proximity constraints, where the discovered prompts must be similar to reference prompts from a given dataset. This dataset serves as a template for the discovered prompts, anchoring the search for test-cases to specific topics, writing styles, or types of harmful behavior. We show that established auto-regressive model architectures do not perform well in this setting. We therefore introduce a black-box red-teaming method inspired by text-diffusion models: Diffusion for Auditing and Red-Teaming (DART). DART modifies the reference prompt by perturbing it in the embedding space, directly controlling the amount of change introduced. We systematically evaluate our method by comparing its effectiveness with established methods based on model fine-tuning and zero- and few-shot prompting. Our results show that DART is significantly more effective at discovering harmful inputs in close proximity to the reference prompt.
Paper Structure (39 sections, 5 equations, 4 figures, 6 tables, 2 algorithms)

This paper contains 39 sections, 5 equations, 4 figures, 6 tables, 2 algorithms.

Figures (4)

  • Figure 1: Illustration of our approach. We are given an initial prompt that results in a harmless answer from the target model. Our goal is to train the attack model in such a way that it modifies the prompt such that the original intent is preserved, but the target model responds in a harmful way.
  • Figure 2: Red dots correspond to prompts that result in harmful responses, while blue ones represents prompts that result in harmless responses. For each prompt, we aim to learn the noise vector $n_a$ that results in harmful behavior, but does not exceed the budget $\epsilon$.
  • Figure 3: Results of the evaluation of DART and the proposed baselines. Attack Success Rate corresponds to the fraction of prompts that result in a response that is classified as toxic with a probability $>50\%$. Cosine similarity depicts the similarity of unmodified and modified prompts. For both metrics higher is better. DART generally outperforms the proposed baselines when comparing it with methods that achieve similar cosine similarity. Unmodified is omitted for the alpaca dataset and GPT2-alpaca and Vicuna, as the ASR is $0$.
  • Figure 4: Safety evaluation of Vicuna-7b. Red corresponds to topics related to violence, green to controversial and adult topics, blue to illegal and dangerous instructions and violet to privacy. The bars indicate the success rate of the prompts on the given topic when modified with DART. The gray dotted line signifies the average success rate. We have divided the subcategories into 4 differently colored areas of harmful behavior. As can be seen from the rate of harmful responses, the model's safety mechanisms are less robust in the area of "Controversial and Adult Topics", while they are very robust with regards to "Self-Harm", "Terrorism and Organized Crime" and "Privacy".