Table of Contents
Fetching ...

Chain-of-Lure: A Universal Jailbreak Attack Framework using Unconstrained Synthetic Narratives

Wenhan Chang, Tianqing Zhu, Yu Zhao, Shuangyong Song, Ping Xiong, Wanlei Zhou

TL;DR

The paper tackles the brittleness of safety alignment by revealing that LLMs can jailbreak other LLMs through narrative induction. It proposes Chain-of-Lure, a dual-chain framework that first converts a sensitive query into a narrative via mission transfer and then optimizes this narrative across turns to bypass safeguards, aided by a helper model and evaluated with a toxicity-based score. Empirical results show near-perfect attack success across diverse victim models and datasets, with large attacker models delivering higher toxicity, and LRMs remaining particularly vulnerable to narrative attacks. The work also demonstrates practical defenses (pre-intent detection and post-threat analysis) and argues for toxicity-based evaluation as a more informative measure of jailbreak effectiveness than traditional refusal-based metrics. Together, these findings stress the need for stronger, multi-layer alignment and dynamic detection techniques to counter sophisticated narrative-based jailbreaks in real-world systems.

Abstract

In the era of rapid generative AI development, interactions with large language models (LLMs) pose increasing risks of misuse. Prior research has primarily focused on attacks using template-based prompts and optimization-oriented methods, while overlooking the fact that LLMs possess strong unconstrained deceptive capabilities to attack other LLMs. This paper introduces a novel jailbreaking method inspired by the Chain-of-Thought mechanism. The attacker employs mission transfer to conceal harmful user intent within dialogue and generates a progressive chain of lure questions without relying on predefined templates, enabling successful jailbreaks. To further improve the attack's strength, we incorporate a helper LLM model that performs randomized narrative optimization over multi-turn interactions, enhancing the attack performance while preserving alignment with the original intent. We also propose a toxicity-based framework using third-party LLMs to evaluate harmful content and its alignment with malicious intent. Extensive experiments demonstrate that our method consistently achieves high attack success rates and elevated toxicity scores across diverse types of LLMs under black-box API settings. These findings reveal the intrinsic potential of LLMs to perform unrestricted attacks in the absence of robust alignment constraints. Our approach offers data-driven insights to inform the design of future alignment mechanisms. Finally, we propose two concrete defense strategies to support the development of safer generative models.

Chain-of-Lure: A Universal Jailbreak Attack Framework using Unconstrained Synthetic Narratives

TL;DR

The paper tackles the brittleness of safety alignment by revealing that LLMs can jailbreak other LLMs through narrative induction. It proposes Chain-of-Lure, a dual-chain framework that first converts a sensitive query into a narrative via mission transfer and then optimizes this narrative across turns to bypass safeguards, aided by a helper model and evaluated with a toxicity-based score. Empirical results show near-perfect attack success across diverse victim models and datasets, with large attacker models delivering higher toxicity, and LRMs remaining particularly vulnerable to narrative attacks. The work also demonstrates practical defenses (pre-intent detection and post-threat analysis) and argues for toxicity-based evaluation as a more informative measure of jailbreak effectiveness than traditional refusal-based metrics. Together, these findings stress the need for stronger, multi-layer alignment and dynamic detection techniques to counter sophisticated narrative-based jailbreaks in real-world systems.

Abstract

In the era of rapid generative AI development, interactions with large language models (LLMs) pose increasing risks of misuse. Prior research has primarily focused on attacks using template-based prompts and optimization-oriented methods, while overlooking the fact that LLMs possess strong unconstrained deceptive capabilities to attack other LLMs. This paper introduces a novel jailbreaking method inspired by the Chain-of-Thought mechanism. The attacker employs mission transfer to conceal harmful user intent within dialogue and generates a progressive chain of lure questions without relying on predefined templates, enabling successful jailbreaks. To further improve the attack's strength, we incorporate a helper LLM model that performs randomized narrative optimization over multi-turn interactions, enhancing the attack performance while preserving alignment with the original intent. We also propose a toxicity-based framework using third-party LLMs to evaluate harmful content and its alignment with malicious intent. Extensive experiments demonstrate that our method consistently achieves high attack success rates and elevated toxicity scores across diverse types of LLMs under black-box API settings. These findings reveal the intrinsic potential of LLMs to perform unrestricted attacks in the absence of robust alignment constraints. Our approach offers data-driven insights to inform the design of future alignment mechanisms. Finally, we propose two concrete defense strategies to support the development of safer generative models.

Paper Structure

This paper contains 31 sections, 3 equations, 2 figures, 9 tables, 3 algorithms.

Figures (2)

  • Figure 1: The overview workflow of Chain-of-Lure jailbreaking method. The mission transfer process begins with an attacker crafting a narrative lure, strategically employing Scenario, Roles, Details and Guidelines, and Mock Serious Questions. These elements, designed to align with human cognitive patterns, embed context-specific questions to progressively extract sensitive details from the victim model. If the initial lure fail, Chain-of-Lure utilizes iterative narrative optimization via a helper model, refining narrative elements to bypass security barriers and achieve the desired harmful response.
  • Figure 2: Comparison of $ASR$ and $TS$ of Reasoning Process and Output after applying Single and Multi-turn CoL on LRMs. We using DeepSeek-V3-1226 as the attacker model. The values in the figures reveal that LRMs are highly susceptible to external manipulation.