Table of Contents
Fetching ...

Defending Jailbreak Prompts via In-Context Adversarial Game

Yujun Zhou, Yufei Han, Haomin Zhuang, Kehan Guo, Zhenwen Liang, Hongyan Bao, Xiangliang Zhang

TL;DR

The paper addresses the vulnerability of LLMs to jailbreak prompts and the limitations of static defenses that rely on fine-tuning or fixed datasets. It proposes ICAG, an in-context adversarial game with attack and defense agents that iteratively generate and refine jailbreak prompts and safety prompts without retraining the model. ICAG achieves significant reductions in jailbreak success across ten unseen attack types and demonstrates transferability of defenses to other LLMs. The method converges within a few iterations, maintains general helpfulness on benchmarks like MMLU, and offers a transferable, dynamic defense paradigm with acknowledged limitations and avenues for future multimodal extensions.

Abstract

Large Language Models (LLMs) demonstrate remarkable capabilities across diverse applications. However, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. Drawing inspiration from adversarial training in deep learning and LLM agent learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning. ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents. This continuous improvement process strengthens defenses against newly generated jailbreak prompts. Our empirical studies affirm ICAG's efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates across various attack scenarios. Moreover, ICAG demonstrates remarkable transferability to other LLMs, indicating its potential as a versatile defense mechanism.

Defending Jailbreak Prompts via In-Context Adversarial Game

TL;DR

The paper addresses the vulnerability of LLMs to jailbreak prompts and the limitations of static defenses that rely on fine-tuning or fixed datasets. It proposes ICAG, an in-context adversarial game with attack and defense agents that iteratively generate and refine jailbreak prompts and safety prompts without retraining the model. ICAG achieves significant reductions in jailbreak success across ten unseen attack types and demonstrates transferability of defenses to other LLMs. The method converges within a few iterations, maintains general helpfulness on benchmarks like MMLU, and offers a transferable, dynamic defense paradigm with acknowledged limitations and avenues for future multimodal extensions.

Abstract

Large Language Models (LLMs) demonstrate remarkable capabilities across diverse applications. However, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. Drawing inspiration from adversarial training in deep learning and LLM agent learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning. ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents. This continuous improvement process strengthens defenses against newly generated jailbreak prompts. Our empirical studies affirm ICAG's efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates across various attack scenarios. Moreover, ICAG demonstrates remarkable transferability to other LLMs, indicating its potential as a versatile defense mechanism.
Paper Structure (37 sections, 3 figures, 18 tables)

This paper contains 37 sections, 3 figures, 18 tables.

Figures (3)

  • Figure 1: Comparison between our proposed ICAG and the Self Reminder from xie2023defending. (a) Self Reminder follows a single round of reasoning and prompts refinement for defending. (b) Our approach involves iterative attack and defense cycles, extracting more insights for both attacking and defending.
  • Figure 2: The overall workflow of In-Context Adversarial Game.
  • Figure 3: The Jailbreak Success Rate (JSR) changing of ICAG over iterations on the validation set.