Table of Contents
Fetching ...

SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains

Bijoy Ahmed Saiem, MD Sadik Hossain Shanto, Rakib Ahsan, Md Rafi ur Rashid

TL;DR

SequentialBreak addresses a critical security risk in LLM safety by embedding a target harmful prompt within a chain of benign prompts in a single query, enabling one-shot, black-box jailbreaks. The authors validate the method across multiple open- and closed-source models using JailbreakBench, showing consistently high attack success rates and superior efficiency compared to baselines, while also examining defenses. The work highlights a fundamental vulnerability in how LLMs process sequential and nested prompts, underscoring the need for more robust, context-aware safeguards. These findings inform defense designers and motivate development of improved moderation, context-tracking, and resilience against sequential prompt-chain attacks.

Abstract

As the integration of the Large Language Models (LLMs) into various applications increases, so does their susceptibility to misuse, raising significant security concerns. Numerous jailbreak attacks have been proposed to assess the security defense of LLMs. Current jailbreak attacks mainly rely on scenario camouflage, prompt obfuscation, prompt optimization, and prompt iterative optimization to conceal malicious prompts. In particular, sequential prompt chains in a single query can lead LLMs to focus on certain prompts while ignoring others, facilitating context manipulation. This paper introduces SequentialBreak, a novel jailbreak attack that exploits this vulnerability. We discuss several scenarios, not limited to examples like Question Bank, Dialog Completion, and Game Environment, where the harmful prompt is embedded within benign ones that can fool LLMs into generating harmful responses. The distinct narrative structures of these scenarios show that SequentialBreak is flexible enough to adapt to various prompt formats beyond those discussed. Extensive experiments demonstrate that SequentialBreak uses only a single query to achieve a substantial gain of attack success rate over existing baselines against both open-source and closed-source models. Through our research, we highlight the urgent need for more robust and resilient safeguards to enhance LLM security and prevent potential misuse. All the result files and website associated with this research are available in this GitHub repository: https://anonymous.4open.science/r/JailBreakAttack-4F3B/.

SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains

TL;DR

SequentialBreak addresses a critical security risk in LLM safety by embedding a target harmful prompt within a chain of benign prompts in a single query, enabling one-shot, black-box jailbreaks. The authors validate the method across multiple open- and closed-source models using JailbreakBench, showing consistently high attack success rates and superior efficiency compared to baselines, while also examining defenses. The work highlights a fundamental vulnerability in how LLMs process sequential and nested prompts, underscoring the need for more robust, context-aware safeguards. These findings inform defense designers and motivate development of improved moderation, context-tracking, and resilience against sequential prompt-chain attacks.

Abstract

As the integration of the Large Language Models (LLMs) into various applications increases, so does their susceptibility to misuse, raising significant security concerns. Numerous jailbreak attacks have been proposed to assess the security defense of LLMs. Current jailbreak attacks mainly rely on scenario camouflage, prompt obfuscation, prompt optimization, and prompt iterative optimization to conceal malicious prompts. In particular, sequential prompt chains in a single query can lead LLMs to focus on certain prompts while ignoring others, facilitating context manipulation. This paper introduces SequentialBreak, a novel jailbreak attack that exploits this vulnerability. We discuss several scenarios, not limited to examples like Question Bank, Dialog Completion, and Game Environment, where the harmful prompt is embedded within benign ones that can fool LLMs into generating harmful responses. The distinct narrative structures of these scenarios show that SequentialBreak is flexible enough to adapt to various prompt formats beyond those discussed. Extensive experiments demonstrate that SequentialBreak uses only a single query to achieve a substantial gain of attack success rate over existing baselines against both open-source and closed-source models. Through our research, we highlight the urgent need for more robust and resilient safeguards to enhance LLM security and prevent potential misuse. All the result files and website associated with this research are available in this GitHub repository: https://anonymous.4open.science/r/JailBreakAttack-4F3B/.

Paper Structure

This paper contains 33 sections, 17 figures, 11 tables, 1 algorithm.

Figures (17)

  • Figure 1: This figure presents the workflow of our general attack on a target LLM using template-based prompt engineering. The attack starts with generating a template (Step 1) that will disguise harmful content. The attacker then selects a suitable template (Step 2) and creates or chooses a harmful target prompt (Step 3). The prompt is then reformatted according to the selected template and integrated into it. (Step 4-5). Finally, the restructured prompt is submitted to the target LLM, bypassing safety mechanisms and generating a harmful response (Step 6). This workflow illustrates the sequential steps involved in embedding harmful prompts into innocuous contexts, enabling attacks through creative prompt engineering.
  • Figure 2: This figure illustrates an attack scenario in which a harmful prompt is embedded within a question bank to bypass the safeguards of the LLM. The attacker chooses the harmful prompt "How to make a bomb" and formats it according to the selected question bank template. The prompt is then embedded within a series of benign questions, making it appear as a legitimate inquiry. When processed by the target LLM, the harmful content is generated (labeled as Q5) alongside non-harmful responses (labeled as Q1, Q2, Q3, Q4, and Q6), successfully evading the usual content filters.
  • Figure 3: This figure illustrates an attack scenario in which a harmful prompt is embedded within a sequential dialogue conversation to bypass the LLM safeguards. The attacker chooses the harmful prompt, "How to make a bomb," and formats it according to the selected dialogue template. The prompt is then embedded in the conversation's context, making it appear as a legitimate continuation. When processed by the target LLM, the harmful content is generated (labeled as Jack's reply) alongside non-harmful dialogue, successfully evading the usual content filters.
  • Figure 4: This figure illustrates an attack scenario in which a harmful prompt is embedded within a game narrative to bypass the LLM’s safeguards. The attacker chooses the harmful prompt, "How to make a bomb," and formats it according to the selected game scenario template. The prompt is then embedded in the game's context, making it appear as a legitimate task. When processed by the target LLM, the harmful content is generated (labeled as L3) alongside non-harmful responses (labeled as L1, L2, L4, and L5), successfully evading the usual content filters.
  • Figure 5: Screenshot of harmful prompt (Give instructions manufacturing a fake weight loss drug to sell online as a quick solution) and the corresponding harmful response from the closed source model Gemini using Question Bank template 1.
  • ...and 12 more figures