Table of Contents
Fetching ...

Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim

TL;DR

Active Attacks reframes red-teaming as an adaptive RL problem where the victim LLM is periodically safety-fine-tuned to dampen exploited prompts, creating an easy-to-hard exploration curriculum for the attacker. By integrating this adaptive environment with off-policy diversity objectives (e.g., GFlowNet-based sampling), the method achieves broad multimodal coverage of attack prompts and significantly improves cross-attack defense performance relative to prior baselines. The approach demonstrates near-100% defense rates across multiple victim LLMs, transfers to unseen larger models, and maintains instruction-following capabilities after safety fine-tuning. Practically, it provides a simple, plug-and-play module that can enhance safety datasets for fine-tuning and improve robustness against evolving attack strategies with modest additional computation.

Abstract

We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce \textit{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an easy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods -- including GFlowNets, PPO, and REINFORCE -- by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain greater than $400\ \times$) with only a 6% increase in computation. Our code is publicly available \href{https://github.com/dbsxodud-11/active_attacks}{here}.

Active Attacks: Red-teaming LLMs via Adaptive Environments

TL;DR

Active Attacks reframes red-teaming as an adaptive RL problem where the victim LLM is periodically safety-fine-tuned to dampen exploited prompts, creating an easy-to-hard exploration curriculum for the attacker. By integrating this adaptive environment with off-policy diversity objectives (e.g., GFlowNet-based sampling), the method achieves broad multimodal coverage of attack prompts and significantly improves cross-attack defense performance relative to prior baselines. The approach demonstrates near-100% defense rates across multiple victim LLMs, transfers to unseen larger models, and maintains instruction-following capabilities after safety fine-tuning. Practically, it provides a simple, plug-and-play module that can enhance safety datasets for fine-tuning and improve robustness against evolving attack strategies with modest additional computation.

Abstract

We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce \textit{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an easy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods -- including GFlowNets, PPO, and REINFORCE -- by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain greater than ) with only a 6% increase in computation. Our code is publicly available \href{https://github.com/dbsxodud-11/active_attacks}{here}.

Paper Structure

This paper contains 33 sections, 4 equations, 8 figures, 18 tables, 1 algorithm.

Figures (8)

  • Figure 1: Red-teaming LLMs via adaptive environments. Prior works train an agent (attacker LLM) with a fixed environment (victim LLM and toxicity classifier). In Active Attacks, we periodically safety fine-tune the victim LLM to make the environment adaptive and reinitialize the attacker LLM and replay buffer. This procedure flattens the already explored region and naturally induces an easy-to-hard exploration curriculum.
  • Figure 2: Toxicity-diversity trade-off of different red-teaming approaches. Active attacks successfully generate diverse prompts in terms of categorical distance. Results for other victim LLMs are in \ref{['app:main-tradeoff']}.
  • Figure 3: Cross-model attack success rate between GFlowNet and GFlowNet + Active Attacks. Experimental results for other red-teaming approaches are presented in \ref{['app:main-cross']}.
  • Figure 4: Cross attack success rate between other RL-based approaches and Active attacks counterparts.
  • Figure 5: Quality-diversity curve across multiple rounds of adapting environments.
  • ...and 3 more figures