Table of Contents
Fetching ...

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, Qingyun Wu

TL;DR

This paper introduces AutoDefense, a multi-agent defense framework that uses response-filtering to mitigate jailbreak attacks on LLMs. By decomposing defenses into input, defense agency, and output roles and exploring configurations from one to three agents, AutoDefense demonstrates improved robustness across diverse jailbreak methods and victim models while preserving normal functionality. Experimental results show substantial reductions in attack success rates (e.g., down to 7.95% with a three-agent system on GPT-3.5 using LLaMA-2-13B as defense) and favorable false-positive rates, supporting model-agnostic applicability. The framework remains extensible, allowing integration of other defenses (e.g., Llama Guard) and scalable deployment on larger LLMs, with acceptable overhead.

Abstract

Despite extensive pre-training in moral alignment to prevent generating harmful information, large language models (LLMs) remain vulnerable to jailbreak attacks. In this paper, we propose AutoDefense, a multi-agent defense framework that filters harmful responses from LLMs. With the response-filtering mechanism, our framework is robust against different jailbreak attack prompts, and can be used to defend different victim models. AutoDefense assigns different roles to LLM agents and employs them to complete the defense task collaboratively. The division in tasks enhances the overall instruction-following of LLMs and enables the integration of other defense components as tools. With AutoDefense, small open-source LMs can serve as agents and defend larger models against jailbreak attacks. Our experiments show that AutoDefense can effectively defense against different jailbreak attacks, while maintaining the performance at normal user request. For example, we reduce the attack success rate on GPT-3.5 from 55.74% to 7.95% using LLaMA-2-13b with a 3-agent system. Our code and data are publicly available at https://github.com/XHMY/AutoDefense.

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

TL;DR

This paper introduces AutoDefense, a multi-agent defense framework that uses response-filtering to mitigate jailbreak attacks on LLMs. By decomposing defenses into input, defense agency, and output roles and exploring configurations from one to three agents, AutoDefense demonstrates improved robustness across diverse jailbreak methods and victim models while preserving normal functionality. Experimental results show substantial reductions in attack success rates (e.g., down to 7.95% with a three-agent system on GPT-3.5 using LLaMA-2-13B as defense) and favorable false-positive rates, supporting model-agnostic applicability. The framework remains extensible, allowing integration of other defenses (e.g., Llama Guard) and scalable deployment on larger LLMs, with acceptable overhead.

Abstract

Despite extensive pre-training in moral alignment to prevent generating harmful information, large language models (LLMs) remain vulnerable to jailbreak attacks. In this paper, we propose AutoDefense, a multi-agent defense framework that filters harmful responses from LLMs. With the response-filtering mechanism, our framework is robust against different jailbreak attack prompts, and can be used to defend different victim models. AutoDefense assigns different roles to LLM agents and employs them to complete the defense task collaboratively. The division in tasks enhances the overall instruction-following of LLMs and enables the integration of other defense components as tools. With AutoDefense, small open-source LMs can serve as agents and defend larger models against jailbreak attacks. Our experiments show that AutoDefense can effectively defense against different jailbreak attacks, while maintaining the performance at normal user request. For example, we reduce the attack success rate on GPT-3.5 from 55.74% to 7.95% using LLaMA-2-13b with a 3-agent system. Our code and data are publicly available at https://github.com/XHMY/AutoDefense.
Paper Structure (24 sections, 5 figures, 18 tables)

This paper contains 24 sections, 5 figures, 18 tables.

Figures (5)

  • Figure 1: Example of AutoDefense against jailbreak attack. In this example, to get the targeted answer from an LLM assistant without being refused, the user constructs a jailbreak prompt using refusal suppression. Before the generated response is presented to the user, it will first be sent to AutoDefense. Whenever our defense determines the response to be invalid, it overrides the response to explicit refusal.
  • Figure 2: Detailed design of the Defense Agency with respect to different numbers of LLM agents. The defense agency is responsible for completing the specific defense task by a multi-agent system. After the defense agency receives the LLM response from the input agent as shown in Figure \ref{['fig:example']}, the defense agency will classify it as valid or invalid. In the single-agent setting on the left, one LLM agent will finish all the analysis tasks and give the judgment. In the two-agent and three-agent settings, agents collaboratively finish the defense task. There is a coordinator agent in the configuration that is responsible for controlling the high-level progress of the defense task.
  • Figure 3: Evaluating defense performance on ASR and FPR with different numbers of agent configurations 5 times on the curated dataset for harmful requests and GPT-4 generated dataset for regular requests.
  • Figure 4: Prompt design for multi-agent defense task agency. In the upper part of the figure is a CoT procedure to classify whether a given system input is valid or invalid. Inspired by CoT procedure, we can separate each step of the CoT and assign the tasks to different agents.
  • Figure 5: Evaluating defense performance on ASR and FPR with different defense LLM configurations for 10 times on the curated dataset for harmful requests and GPT-4 generated dataset for regular requests. The defense result in this figure is obtained using the three-agent configuration.