Table of Contents
Fetching ...

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Maya Pavlova, Erik Brinkman, Krithika Iyer, Vitor Albiero, Joanna Bitton, Hailey Nguyen, Joe Li, Cristian Canton Ferrer, Ivan Evtimov, Aaron Grattafiori

TL;DR

The paper addresses the gap in red-teaming LLMs by focusing on multi-turn, conversation-based adversarial testing rather than single-turn prompts. It introduces GOAT, an automated agent that reasons through a suite of prompting techniques within extended dialogues to probe safety boundaries of target LLMs. GOAT combines an attacker LLM, a structured reasoning protocol, and a modular attack toolbox to perform scalable, multi-turn jailbreaks, achieving high attack success rates on JailbreakBench datasets (ASR@10 of 97% for Llama 3.1 and 88% for GPT-4-Turbo) with a five-turn budget. The results show GOAT outperforms prior multi-turn approaches like Crescendo under equivalent conditions, highlighting the value of conversational context and iterative strategy selection for red-teaming in real-world settings.

Abstract

Red teaming assesses how large language models (LLMs) can produce content that violates norms, policies, and rules set during their safety training. However, most existing automated methods in the literature are not representative of the way humans tend to interact with AI models. Common users of AI models may not have advanced knowledge of adversarial machine learning methods or access to model internals, and they do not spend a lot of time crafting a single highly effective adversarial prompt. Instead, they are likely to make use of techniques commonly shared online and exploit the multiturn conversational nature of LLMs. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general-purpose model in a way that encourages reasoning through the choices of methods available, the current target model's response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 97% against Llama 3.1 and 88% against GPT-4 on the JailbreakBench dataset.

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

TL;DR

The paper addresses the gap in red-teaming LLMs by focusing on multi-turn, conversation-based adversarial testing rather than single-turn prompts. It introduces GOAT, an automated agent that reasons through a suite of prompting techniques within extended dialogues to probe safety boundaries of target LLMs. GOAT combines an attacker LLM, a structured reasoning protocol, and a modular attack toolbox to perform scalable, multi-turn jailbreaks, achieving high attack success rates on JailbreakBench datasets (ASR@10 of 97% for Llama 3.1 and 88% for GPT-4-Turbo) with a five-turn budget. The results show GOAT outperforms prior multi-turn approaches like Crescendo under equivalent conditions, highlighting the value of conversational context and iterative strategy selection for red-teaming in real-world settings.

Abstract

Red teaming assesses how large language models (LLMs) can produce content that violates norms, policies, and rules set during their safety training. However, most existing automated methods in the literature are not representative of the way humans tend to interact with AI models. Common users of AI models may not have advanced knowledge of adversarial machine learning methods or access to model internals, and they do not spend a lot of time crafting a single highly effective adversarial prompt. Instead, they are likely to make use of techniques commonly shared online and exploit the multiturn conversational nature of LLMs. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general-purpose model in a way that encourages reasoning through the choices of methods available, the current target model's response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 97% against Llama 3.1 and 88% against GPT-4 on the JailbreakBench dataset.
Paper Structure (26 sections, 9 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: High-level schematic of GOAT. Given a violating conversational objective and adversarial attack information, the attacker LLM is initialized and chained to converse with a target LLM. Upon each conversation turn, the attacker LLM is directed to reason through the reply using an "observation, thought, and strategy" structure. The final conversation is evaluated using an external judge.
  • Figure 2: Attack Success Rate by Target Model for Crescendo and GOAT. Across all models, GOAT outperforms Crescendo, performing worse on Llama 2 7B, with an 85% ASR @10.
  • Figure 3: Attack success rate broken down by conversation turns for Llama 3.1 8B (left) and GPT-4 (right).
  • Figure 4: Distribution of attacks chosen by GOAT in successful conversations against Llama 3.1 8B. The attacker model has the most success starting with a hypothetical scenario, but more evenly leverages attacks on the final turn.
  • Figure A.1: Attacker LLM system prompt that includes a general initialization as a red teaming assistant, the reasoning instructions, the conversation objective, and the in-context attack information placeholders.
  • ...and 4 more figures