Table of Contents
Fetching ...

Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles

Xiongtao Sun, Deyue Zhang, Dongdong Yang, Quanchen Zou, Hui Li

TL;DR

The paper tackles security vulnerabilities in large language models (LLMs) by examining multi-turn semantic jailbreaks and identifying context as a critical enabling factor. It introduces Contextual Fusion Attack (CFA), a three-stage, context-driven framework that extracts keywords, builds contextual scenarios, and integrates targets to elicit harmful outputs while concealing malicious intent in a black-box setting. Empirical results across three datasets and six models show CFA achieving higher attack success, lower semantic deviation, and greater harmfulness than existing multi-turn baselines, with notable strengths on Llama3 and GPT-4. These findings highlight the need for defenses that address context-based, dynamic loading mechanisms in multi-turn interactions and advance the understanding of how contextual manipulation can undermine LLM security.

Abstract

Large language models (LLMs) have significantly enhanced the performance of numerous applications, from intelligent conversations to text generation. However, their inherent security vulnerabilities have become an increasingly significant challenge, especially with respect to jailbreak attacks. Attackers can circumvent the security mechanisms of these LLMs, breaching security constraints and causing harmful outputs. Focusing on multi-turn semantic jailbreak attacks, we observe that existing methods lack specific considerations for the role of multiturn dialogues in attack strategies, leading to semantic deviations during continuous interactions. Therefore, in this paper, we establish a theoretical foundation for multi-turn attacks by considering their support in jailbreak attacks, and based on this, propose a context-based contextual fusion black-box jailbreak attack method, named Context Fusion Attack (CFA). This method approach involves filtering and extracting key terms from the target, constructing contextual scenarios around these terms, dynamically integrating the target into the scenarios, replacing malicious key terms within the target, and thereby concealing the direct malicious intent. Through comparisons on various mainstream LLMs and red team datasets, we have demonstrated CFA's superior success rate, divergence, and harmfulness compared to other multi-turn attack strategies, particularly showcasing significant advantages on Llama3 and GPT-4.

Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles

TL;DR

The paper tackles security vulnerabilities in large language models (LLMs) by examining multi-turn semantic jailbreaks and identifying context as a critical enabling factor. It introduces Contextual Fusion Attack (CFA), a three-stage, context-driven framework that extracts keywords, builds contextual scenarios, and integrates targets to elicit harmful outputs while concealing malicious intent in a black-box setting. Empirical results across three datasets and six models show CFA achieving higher attack success, lower semantic deviation, and greater harmfulness than existing multi-turn baselines, with notable strengths on Llama3 and GPT-4. These findings highlight the need for defenses that address context-based, dynamic loading mechanisms in multi-turn interactions and advance the understanding of how contextual manipulation can undermine LLM security.

Abstract

Large language models (LLMs) have significantly enhanced the performance of numerous applications, from intelligent conversations to text generation. However, their inherent security vulnerabilities have become an increasingly significant challenge, especially with respect to jailbreak attacks. Attackers can circumvent the security mechanisms of these LLMs, breaching security constraints and causing harmful outputs. Focusing on multi-turn semantic jailbreak attacks, we observe that existing methods lack specific considerations for the role of multiturn dialogues in attack strategies, leading to semantic deviations during continuous interactions. Therefore, in this paper, we establish a theoretical foundation for multi-turn attacks by considering their support in jailbreak attacks, and based on this, propose a context-based contextual fusion black-box jailbreak attack method, named Context Fusion Attack (CFA). This method approach involves filtering and extracting key terms from the target, constructing contextual scenarios around these terms, dynamically integrating the target into the scenarios, replacing malicious key terms within the target, and thereby concealing the direct malicious intent. Through comparisons on various mainstream LLMs and red team datasets, we have demonstrated CFA's superior success rate, divergence, and harmfulness compared to other multi-turn attack strategies, particularly showcasing significant advantages on Llama3 and GPT-4.
Paper Structure (12 sections, 4 equations, 7 figures, 3 tables)

This paper contains 12 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison of jailbreak attacks: Multi-turn attacks generate multiple rounds of questions around the target.
  • Figure 2: Illustration of CFA. The CFA consists of three stages: (1) Preprocess, where malicious keywords are filtered and extracted; (2) Context Generation, which generates multi-turn contexts based on these keywords; and (3) Target Trigger, where contextual scenarios are integrated and malicious keywords are strategically replaced to dynamically trigger attacks while reducing overt maliciousness, thereby evading the security mechanisms of LLMs.
  • Figure 3: The pipeline of CFA.
  • Figure 4: Examples of keywords in malicious question. Keywords in green are semantically irrelevant and can be directly removed, while keywords in red, which are semantically relevant, are extracted for generating context.
  • Figure 5: Example for context generation prompt.
  • ...and 2 more figures