Table of Contents
Fetching ...

FLRT: Fluent Student-Teacher Redteaming

T. Ben Thompson, Michael Sklar

TL;DR

The paper tackles the vulnerability of safety-tuned instruction-following language models to adversarial prompts by proposing a fluent, token-level red-teaming framework. It advances token-level optimization with a distillation-based objective that trains a toxified copy to guide the victim model's outputs or activations, augmented by multi-model perplexity and repetition penalties to produce human-like prompts. By integrating and extending GCG and BEAST with insert/delete/swap operations and allowing variable prompt length, the method achieves high attack success rates (e.g., >$93\%$ on several models and a universal fluent prompt with broad transfer) while maintaining reasonable perplexity. The work offers practical red-teaming guidance and illustrates the ongoing arms race between fluency-focused jailbreaks and safety defenses, highlighting both the efficacy and the costs of such attacks and suggesting directions for robust model hardening.

Abstract

Many publicly available language models have been safety tuned to reduce the likelihood of toxic or liability-inducing text. To redteam or jailbreak these models for compliance with toxic requests, users and security analysts have developed adversarial prompting techniques. One attack method is to apply discrete optimization techniques to the prompt. However, the resulting attack strings are often gibberish text, easily filtered by defenders due to high measured perplexity, and may fail for unseen tasks and/or well-tuned models. In this work, we improve existing algorithms (primarily GCG and BEAST) to develop powerful and fluent attacks on safety-tuned models like Llama-2 and Phi-3. Our technique centers around a new distillation-based approach that encourages the victim model to emulate a toxified finetune, either in terms of output probabilities or internal activations. To encourage human-fluent attacks, we add a multi-model perplexity penalty and a repetition penalty to the objective. We also enhance optimizer strength by allowing token insertions, token swaps, and token deletions and by using longer attack sequences. The resulting process is able to reliably jailbreak the most difficult target models with prompts that appear similar to human-written prompts. On Advbench we achieve attack success rates $>93$% for Llama-2-7B, Llama-3-8B, and Vicuna-7B, while maintaining model-measured perplexity $<33$; we achieve $95$% attack success for Phi-3, though with higher perplexity. We also find a universally-optimized single fluent prompt that induces $>88$% compliance on previously unseen tasks across Llama-2-7B, Phi-3-mini and Vicuna-7B and transfers to other black-box models.

FLRT: Fluent Student-Teacher Redteaming

TL;DR

The paper tackles the vulnerability of safety-tuned instruction-following language models to adversarial prompts by proposing a fluent, token-level red-teaming framework. It advances token-level optimization with a distillation-based objective that trains a toxified copy to guide the victim model's outputs or activations, augmented by multi-model perplexity and repetition penalties to produce human-like prompts. By integrating and extending GCG and BEAST with insert/delete/swap operations and allowing variable prompt length, the method achieves high attack success rates (e.g., > on several models and a universal fluent prompt with broad transfer) while maintaining reasonable perplexity. The work offers practical red-teaming guidance and illustrates the ongoing arms race between fluency-focused jailbreaks and safety defenses, highlighting both the efficacy and the costs of such attacks and suggesting directions for robust model hardening.

Abstract

Many publicly available language models have been safety tuned to reduce the likelihood of toxic or liability-inducing text. To redteam or jailbreak these models for compliance with toxic requests, users and security analysts have developed adversarial prompting techniques. One attack method is to apply discrete optimization techniques to the prompt. However, the resulting attack strings are often gibberish text, easily filtered by defenders due to high measured perplexity, and may fail for unseen tasks and/or well-tuned models. In this work, we improve existing algorithms (primarily GCG and BEAST) to develop powerful and fluent attacks on safety-tuned models like Llama-2 and Phi-3. Our technique centers around a new distillation-based approach that encourages the victim model to emulate a toxified finetune, either in terms of output probabilities or internal activations. To encourage human-fluent attacks, we add a multi-model perplexity penalty and a repetition penalty to the objective. We also enhance optimizer strength by allowing token insertions, token swaps, and token deletions and by using longer attack sequences. The resulting process is able to reliably jailbreak the most difficult target models with prompts that appear similar to human-written prompts. On Advbench we achieve attack success rates % for Llama-2-7B, Llama-3-8B, and Vicuna-7B, while maintaining model-measured perplexity ; we achieve % attack success for Phi-3, though with higher perplexity. We also find a universally-optimized single fluent prompt that induces % compliance on previously unseen tasks across Llama-2-7B, Phi-3-mini and Vicuna-7B and transfers to other black-box models.
Paper Structure (27 sections, 12 equations, 2 figures, 7 tables)

This paper contains 27 sections, 12 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Here we break down the parts of a typical adversarial attack optimization, in the frame of zou2023universal. The gray tokens form the chat template, the blue tokens are the desired task, the red tokens are the optimized attack itself and the purple tokens are the model's generation. The full user prompt is between the <|user|> token and the <|end|> token. The example here uses the Phi-3 tokenizer and chat template. Other models use an equivalent prompt structure.
  • Figure 2: Increasing prompt length improves the objective of combined attack effectiveness and fluency. Displayed points are averaged over five independent optimizations at each length.