Table of Contents
Fetching ...

Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming

Jiaxu Liu, Xiangyu Yin, Sihao Wu, Jianhong Wang, Meng Fang, Xinping Yi, Xiaowei Huang

TL;DR

The paper tackles the safety and robustness gaps in LLM red-teaming by introducing a prefix-based sentinel that detoxifies prompts with only a small token addition, without modifying the target model. It frames the defense as a two-agent Stackelberg game trained with PPO, featuring a MAPPO-inspired, partially shared value head to stabilize learning across agents. Across text-to-text and text-to-image tasks, and against models such as Llama-2 and GPT-3.5, the sentinel consistently reduces toxic outputs while preserving usefulness for smaller targets, and remains scalable to larger models with some trade-offs in factuality for the largest targets. These findings suggest a practical, model-agnostic pathway to safer LLM deployment, though further work is needed to maintain quality on very large models and to broaden safety criteria beyond toxicity.

Abstract

With the proliferation of red-teaming strategies for Large Language Models (LLMs), the deficiency in the literature about improving the safety and robustness of LLM defense strategies is becoming increasingly pronounced. This paper introduces the LLM-based \textbf{sentinel} model as a plug-and-play prefix module designed to reconstruct the input prompt with just a few ($<30$) additional tokens, effectively reducing toxicity in responses from target LLMs. The sentinel model naturally overcomes the \textit{parameter inefficiency} and \textit{limited model accessibility} for fine-tuning large target models. We employ an interleaved training regimen using Proximal Policy Optimization (PPO) to optimize both red team and sentinel models dynamically, incorporating a value head-sharing mechanism inspired by the multi-agent centralized critic to manage the complex interplay between agents. Our extensive experiments across text-to-text and text-to-image demonstrate the effectiveness of our approach in mitigating toxic outputs, even when dealing with larger models like \texttt{Llama-2}, \texttt{GPT-3.5} and \texttt{Stable-Diffusion}, highlighting the potential of our framework in enhancing safety and robustness in various applications.

Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming

TL;DR

The paper tackles the safety and robustness gaps in LLM red-teaming by introducing a prefix-based sentinel that detoxifies prompts with only a small token addition, without modifying the target model. It frames the defense as a two-agent Stackelberg game trained with PPO, featuring a MAPPO-inspired, partially shared value head to stabilize learning across agents. Across text-to-text and text-to-image tasks, and against models such as Llama-2 and GPT-3.5, the sentinel consistently reduces toxic outputs while preserving usefulness for smaller targets, and remains scalable to larger models with some trade-offs in factuality for the largest targets. These findings suggest a practical, model-agnostic pathway to safer LLM deployment, though further work is needed to maintain quality on very large models and to broaden safety criteria beyond toxicity.

Abstract

With the proliferation of red-teaming strategies for Large Language Models (LLMs), the deficiency in the literature about improving the safety and robustness of LLM defense strategies is becoming increasingly pronounced. This paper introduces the LLM-based \textbf{sentinel} model as a plug-and-play prefix module designed to reconstruct the input prompt with just a few () additional tokens, effectively reducing toxicity in responses from target LLMs. The sentinel model naturally overcomes the \textit{parameter inefficiency} and \textit{limited model accessibility} for fine-tuning large target models. We employ an interleaved training regimen using Proximal Policy Optimization (PPO) to optimize both red team and sentinel models dynamically, incorporating a value head-sharing mechanism inspired by the multi-agent centralized critic to manage the complex interplay between agents. Our extensive experiments across text-to-text and text-to-image demonstrate the effectiveness of our approach in mitigating toxic outputs, even when dealing with larger models like \texttt{Llama-2}, \texttt{GPT-3.5} and \texttt{Stable-Diffusion}, highlighting the potential of our framework in enhancing safety and robustness in various applications.
Paper Structure (36 sections, 24 equations, 10 figures, 9 tables)

This paper contains 36 sections, 24 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Red-teaming and sentinel defense procedure. $\pi_r$ generates tokens based on the corpus $z$, giving $x$ that elicit toxic target responses. $\pi_s$ continues from $x$ and elicit safe target responses.
  • Figure 2: Schematic of our framework. denote the frozen (inference-only) modules. (a) optimizing the red team model to generate toxic prompts. (b) optimizing the sentinel model to defend red-teaming. The KL module align $\pi$ with reference $\pi^\mathrm{ref}$, constraining $\pi$ to not output gibberish.
  • Figure 3: Different value head strategies under the multi-LLM-agent scenario.
  • Figure 4: The rewards $R(y)$, $R(y^\star)$ and metric $R_\mathrm{mae}^{0.5}$ for Text Continuation task. We train as in framework Fig. \ref{['figure:pipeline']}. GPT2 (red team) and GPT2 (sentinel) are employed with sentinel model untrained. The target model is GPT2-imdb.
  • Figure 5: Our framework enables both the red team and sentinel model to excel in attack and defense against target LLMs in Text Continuation (TC) and Instruction Following (IF). The curves represent the mean values of various reward metrics throughout the PPO optimization epochs, and the shaded area shows the standard deviation. See Sec. \ref{['sec:curiosity-baselines-cmp']} for details.
  • ...and 5 more figures