Table of Contents
Fetching ...

Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game

Qianqiao Xu, Zhiliang Tian, Hongyan Wu, Zhen Huang, Yiping Song, Feng Liu, Dongsheng Li

TL;DR

The paper tackles the vulnerability of traditional safety defenses in LLMs, where outright rejection can be exploited by attackers. It introduces a multi-agent attacker–disguiser framework that leverages in-context learning and curriculum learning to produce safe yet disguised responses, framed as a zero-sum game optimized via Minimax Q-learning. Four roles—attacker, disguiser, safety evaluator, and disguise evaluator—interact cyclically to generate and assess enhanced samples, converging to a Nash equilibrium where defense intent remains concealed while safety is maintained. Experimental results on GPT-3.5 and GPT-4 show higher proportions of disguised, safe responses compared to baselines, demonstrating robustness and adaptability to black-box models without modifying underlying parameters.

Abstract

With the enhanced performance of large models on natural language processing tasks, potential moral and ethical issues of large models arise. There exist malicious attackers who induce large models to jailbreak and generate information containing illegal, privacy-invasive information through techniques such as prompt engineering. As a result, large models counter malicious attackers' attacks using techniques such as safety alignment. However, the strong defense mechanism of the large model through rejection replies is easily identified by attackers and used to strengthen attackers' capabilities. In this paper, we propose a multi-agent attacker-disguiser game approach to achieve a weak defense mechanism that allows the large model to both safely reply to the attacker and hide the defense intent. First, we construct a multi-agent framework to simulate attack and defense scenarios, playing different roles to be responsible for attack, disguise, safety evaluation, and disguise evaluation tasks. After that, we design attack and disguise game algorithms to optimize the game strategies of the attacker and the disguiser and use the curriculum learning process to strengthen the capabilities of the agents. The experiments verify that the method in this paper is more effective in strengthening the model's ability to disguise the defense intent compared with other methods. Moreover, our approach can adapt any black-box large model to assist the model in defense and does not suffer from model version iterations.

Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game

TL;DR

The paper tackles the vulnerability of traditional safety defenses in LLMs, where outright rejection can be exploited by attackers. It introduces a multi-agent attacker–disguiser framework that leverages in-context learning and curriculum learning to produce safe yet disguised responses, framed as a zero-sum game optimized via Minimax Q-learning. Four roles—attacker, disguiser, safety evaluator, and disguise evaluator—interact cyclically to generate and assess enhanced samples, converging to a Nash equilibrium where defense intent remains concealed while safety is maintained. Experimental results on GPT-3.5 and GPT-4 show higher proportions of disguised, safe responses compared to baselines, demonstrating robustness and adaptability to black-box models without modifying underlying parameters.

Abstract

With the enhanced performance of large models on natural language processing tasks, potential moral and ethical issues of large models arise. There exist malicious attackers who induce large models to jailbreak and generate information containing illegal, privacy-invasive information through techniques such as prompt engineering. As a result, large models counter malicious attackers' attacks using techniques such as safety alignment. However, the strong defense mechanism of the large model through rejection replies is easily identified by attackers and used to strengthen attackers' capabilities. In this paper, we propose a multi-agent attacker-disguiser game approach to achieve a weak defense mechanism that allows the large model to both safely reply to the attacker and hide the defense intent. First, we construct a multi-agent framework to simulate attack and defense scenarios, playing different roles to be responsible for attack, disguise, safety evaluation, and disguise evaluation tasks. After that, we design attack and disguise game algorithms to optimize the game strategies of the attacker and the disguiser and use the curriculum learning process to strengthen the capabilities of the agents. The experiments verify that the method in this paper is more effective in strengthening the model's ability to disguise the defense intent compared with other methods. Moreover, our approach can adapt any black-box large model to assist the model in defense and does not suffer from model version iterations.
Paper Structure (27 sections, 3 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 27 sections, 3 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: General illustration of our method. We construct a multi-agent framework consisting of an attacker, a disguiser, a safety evaluator, and a disguise evaluator to simulate the attack and defense scenarios. The attacker and the disguiser generate the attack sample set and the disguise sample set through in-context learning, respectively. Afterward, based on the reward feedback given by the evaluators, they separately game to select a new round of enhanced samples.
  • Figure 2: Comparison of the normal security response mechanism and the disguising defense intent response mechanism. Figure (a) on the left side shows the normal security response defended by rejection. This type of response is easily detected by the attacker and strengthens the attacker's capabilities. Figure (b) on the right shows a safe response that disguises the defense intent and can confuse the attacker.