Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Tong Liu; Yingjie Zhang; Zhe Zhao; Yinpeng Dong; Guozhu Meng; Kai Chen

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, Kai Chen

TL;DR

The paper identifies a latent safety bias in LLM fine-tuning that differentiates harmful content between queries and completions. It formalizes a vulnerability and proposes the Disguise and Reconstruction Attack (DRA), a black-box jailbreak that disguises harmful instructions and prompts the model to reconstruct them in its completion, aided by context manipulation. Empirical results show DRA achieving state-of-the-art jailbreak rates across multiple models, including up to roughly 91% on GPT-4, while remaining efficient in query usage and robust to several defenses. The work also analyzes mitigations, including unbiased training and stronger system prompts, and discusses ethical considerations and future defenses necessary to bolster LLM safety. Overall, the study exposes a critical flaw in current safety alignment workflows and offers a concrete, transferable attack framework with clear implications for defense research.

Abstract

In recent years, large language models (LLMs) have demonstrated notable success across various tasks, but the trustworthiness of LLMs is still an open problem. One specific threat is the potential to generate toxic or harmful responses. Attackers can craft adversarial prompts that induce harmful responses from LLMs. In this work, we pioneer a theoretical foundation in LLMs security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named DRA (Disguise and Reconstruction Attack), which conceals harmful instructions through disguise and prompts the model to reconstruct the original harmful instruction within its completion. We evaluate DRA across various open-source and closed-source models, showcasing state-of-the-art jailbreak success rates and attack efficiency. Notably, DRA boasts a 91.1% attack success rate on OpenAI GPT-4 chatbot.

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

TL;DR

Abstract

Paper Structure (30 sections, 10 equations, 5 figures, 11 tables, 2 algorithms)

This paper contains 30 sections, 10 equations, 5 figures, 11 tables, 2 algorithms.

Introduction
Background & Problem Statement
Large Language Models
LLM Jailbreak
Safety Alignment of LLM
Problem Statement
Safety Biases in LLM Fine-Tuning and the Resultant Vulnerability
Dialog Modeling and its Discrepancy
Fine-Tuning and its Safety Biases
Formal Definition of the Vulnerability
Approach
Harmful Instruction Disguise
Payload Reconstruction
Context Manipulation
Evaluation
...and 15 more sections

Figures (5)

Figure 1: DRA "disguise" + "reconstruction" jailbreak pipeline overview.
Figure 2: An example of puzzle-based obfuscation to disguise the harmful text "rob".
Figure 3: An example of word-level split about "How to rob a bank valut" with two cutoff strategies, after word-level splitting, the input question is "Ho to ro a nk vau lt", where P represents for no split.
Figure 4: Differential log-perplexities of harmful instructions.
Figure 5: Distribution of differential log-perplexity of LLAMA-2-13B-Chat's responses to harmful instructions with varied inducing prefixes. Cooperation and declination are denoted as "Coo." and "Dec." respectively in the plot legends, while the inducing prefixes are presented above each subplot.

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

TL;DR

Abstract

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Authors

TL;DR

Abstract

Table of Contents

Figures (5)