Table of Contents
Fetching ...

Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation

Wenlong Meng, Fan Zhang, Wendao Yao, Zhenyuan Guo, Yuwei Li, Chengkun Wei, Wenzhi Chen

TL;DR

This work introduces Dialogue Injection Attack (DIA), a new black-box jailbreak paradigm that exploits LLM chat templates and dialogue history to induce harmful outputs. DIA comprises two methods, DIA-I (refined prefilling with system prompts, hypnosis, affirmative beginnings, and answer guidance) and DIA-II (deferral via word-substitution with benign demonstrations), along with a promptRewrite optimization to sustain effectiveness across multi-turn prompts. Empirical results show DIA achieving state-of-the-art attack success on recent models (e.g., Llama-3.1-8B and GPT-4o) and bypassing multiple defenses, with substantial gains as the number of queries increases. The findings highlight a critical need for defenses that account for multi-turn dialogue contexts and template dynamics to mitigate historical-dialogue-driven jailbreaks in real-world LLM deployments.

Abstract

Large language models (LLMs) have demonstrated significant utility in a wide range of applications; however, their deployment is plagued by security vulnerabilities, notably jailbreak attacks. These attacks manipulate LLMs to generate harmful or unethical content by crafting adversarial prompts. While much of the current research on jailbreak attacks has focused on single-turn interactions, it has largely overlooked the impact of historical dialogues on model behavior. In this paper, we introduce a novel jailbreak paradigm, Dialogue Injection Attack (DIA), which leverages the dialogue history to enhance the success rates of such attacks. DIA operates in a black-box setting, requiring only access to the chat API or knowledge of the LLM's chat template. We propose two methods for constructing adversarial historical dialogues: one adapts gray-box prefilling attacks, and the other exploits deferred responses. Our experiments show that DIA achieves state-of-the-art attack success rates on recent LLMs, including Llama-3.1 and GPT-4o. Additionally, we demonstrate that DIA can bypass 5 different defense mechanisms, highlighting its robustness and effectiveness.

Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation

TL;DR

This work introduces Dialogue Injection Attack (DIA), a new black-box jailbreak paradigm that exploits LLM chat templates and dialogue history to induce harmful outputs. DIA comprises two methods, DIA-I (refined prefilling with system prompts, hypnosis, affirmative beginnings, and answer guidance) and DIA-II (deferral via word-substitution with benign demonstrations), along with a promptRewrite optimization to sustain effectiveness across multi-turn prompts. Empirical results show DIA achieving state-of-the-art attack success on recent models (e.g., Llama-3.1-8B and GPT-4o) and bypassing multiple defenses, with substantial gains as the number of queries increases. The findings highlight a critical need for defenses that account for multi-turn dialogue contexts and template dynamics to mitigate historical-dialogue-driven jailbreaks in real-world LLM deployments.

Abstract

Large language models (LLMs) have demonstrated significant utility in a wide range of applications; however, their deployment is plagued by security vulnerabilities, notably jailbreak attacks. These attacks manipulate LLMs to generate harmful or unethical content by crafting adversarial prompts. While much of the current research on jailbreak attacks has focused on single-turn interactions, it has largely overlooked the impact of historical dialogues on model behavior. In this paper, we introduce a novel jailbreak paradigm, Dialogue Injection Attack (DIA), which leverages the dialogue history to enhance the success rates of such attacks. DIA operates in a black-box setting, requiring only access to the chat API or knowledge of the LLM's chat template. We propose two methods for constructing adversarial historical dialogues: one adapts gray-box prefilling attacks, and the other exploits deferred responses. Our experiments show that DIA achieves state-of-the-art attack success rates on recent LLMs, including Llama-3.1 and GPT-4o. Additionally, we demonstrate that DIA can bypass 5 different defense mechanisms, highlighting its robustness and effectiveness.

Paper Structure

This paper contains 25 sections, 8 equations, 10 figures, 6 tables, 3 algorithms.

Figures (10)

  • Figure 1: LLM chat pipeline.
  • Figure 2: Evaluation results of chat template inference attack. We test on three LLMs, using two of their chat templates to construct probes.
  • Figure 3: Dialogue construction processes of $\mathsf{DIA}$-I and $\mathsf{DIA}$-II. This figure illustrates the composition of $\mathsf{DIA}$-I and $\mathsf{DIA}$-II dialogues on the left and right sides, respectively. $\mathsf{DIA}$-I prompts the victim LLM to continue its previous affirmative response, while $\mathsf{DIA}$-II engages the victim LLM in a word substitution task, subsequently answering prompts with the substituted words. $\mathsf{DIA}$-I utilizes an Affirmative Beginning Generation Module (ABGM) to automate the creation of affirmative beginnings of malicious responses. $\mathsf{DIA}$-II employs a Similar Demonstration Generation Module (SDGM) to generate benign demonstrations.
  • Figure 4: Log-likelihood distribution of affirmative beginnings with and without pretending benign texts. Prepending benign text makes harmful content more likely to appear.
  • Figure 5: Multi-query attack results on AdvBench.
  • ...and 5 more figures