Table of Contents
Fetching ...

Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning

Si Chen, Xiao Yu, Ninareh Mehrabi, Rahul Gupta, Zhou Yu, Ruoxi Jia

TL;DR

This work introduces GALA, a dual-learning, memory-enabled agent for multi-turn red-teaming of LLMs. By combining global tactic-wise learning with local prompt-wise learning and maintaining an explicit belief state, GALA discovers new attack tactics, refines prompt implementations, and plans context-aware strategies across turns. Empirical results on JailbreakBench show GALA achieving state-of-the-art attack success rates and higher prompt diversity across multiple target models, including GPT-3.5-Turbo and Llama-3.1 variants, with notable gains from dual learning and robustness under weaker attacker models. The approach offers a more comprehensive and scalable framework for automated security evaluation and vulnerability discovery in realistic multi-turn adversarial settings.

Abstract

The exploitation of large language models (LLMs) for malicious purposes poses significant security risks as these models become more powerful and widespread. While most existing red-teaming frameworks focus on single-turn attacks, real-world adversaries typically operate in multi-turn scenarios, iteratively probing for vulnerabilities and adapting their prompts based on threat model responses. In this paper, we propose \AlgName, a novel multi-turn red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions: global tactic-wise learning that accumulates knowledge over time and generalizes to new attack goals, and local prompt-wise learning that refines implementations for specific goals when initial attempts fail. Unlike previous multi-turn approaches that rely on fixed strategy sets, \AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics. Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90\% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns, outperforming state-of-the-art baselines. These results highlight the effectiveness of dynamic learning in identifying and exploiting model vulnerabilities in realistic multi-turn scenarios.

Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning

TL;DR

This work introduces GALA, a dual-learning, memory-enabled agent for multi-turn red-teaming of LLMs. By combining global tactic-wise learning with local prompt-wise learning and maintaining an explicit belief state, GALA discovers new attack tactics, refines prompt implementations, and plans context-aware strategies across turns. Empirical results on JailbreakBench show GALA achieving state-of-the-art attack success rates and higher prompt diversity across multiple target models, including GPT-3.5-Turbo and Llama-3.1 variants, with notable gains from dual learning and robustness under weaker attacker models. The approach offers a more comprehensive and scalable framework for automated security evaluation and vulnerability discovery in realistic multi-turn adversarial settings.

Abstract

The exploitation of large language models (LLMs) for malicious purposes poses significant security risks as these models become more powerful and widespread. While most existing red-teaming frameworks focus on single-turn attacks, real-world adversaries typically operate in multi-turn scenarios, iteratively probing for vulnerabilities and adapting their prompts based on threat model responses. In this paper, we propose \AlgName, a novel multi-turn red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions: global tactic-wise learning that accumulates knowledge over time and generalizes to new attack goals, and local prompt-wise learning that refines implementations for specific goals when initial attempts fail. Unlike previous multi-turn approaches that rely on fixed strategy sets, \AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics. Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90\% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns, outperforming state-of-the-art baselines. These results highlight the effectiveness of dynamic learning in identifying and exploiting model vulnerabilities in realistic multi-turn scenarios.

Paper Structure

This paper contains 20 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of GALA's workflow. In each interaction round, GALA ① performs adaptive planning and ② updates its belief state. Upon completion of a trial, it ③ conducts global tactic-wise learning if the trial succeeds, or local prompt-wise learning otherwise.
  • Figure 2: GALA ourperforms the strongest baseline GOAT on most categories, demonstrating its effectiveness across diverse attack scenarios.
  • Figure 3: The initial knowledge base used by GALA and GOAT, including comprehensive definitions of each tactic and selects illustrative examples from lillm to guide the automated red teaming processes.
  • Figure 4: The prompt used by our attacker model to make an attack plan at the initial interaction round for a given malicious goal.
  • Figure 5: The prompt used by our attacker model to make an attack plan at the intermediate interaction round for a given malicious goal.
  • ...and 3 more figures