Chain-of-Lure: A Universal Jailbreak Attack Framework using Unconstrained Synthetic Narratives
Wenhan Chang, Tianqing Zhu, Yu Zhao, Shuangyong Song, Ping Xiong, Wanlei Zhou
TL;DR
The paper tackles the brittleness of safety alignment by revealing that LLMs can jailbreak other LLMs through narrative induction. It proposes Chain-of-Lure, a dual-chain framework that first converts a sensitive query into a narrative via mission transfer and then optimizes this narrative across turns to bypass safeguards, aided by a helper model and evaluated with a toxicity-based score. Empirical results show near-perfect attack success across diverse victim models and datasets, with large attacker models delivering higher toxicity, and LRMs remaining particularly vulnerable to narrative attacks. The work also demonstrates practical defenses (pre-intent detection and post-threat analysis) and argues for toxicity-based evaluation as a more informative measure of jailbreak effectiveness than traditional refusal-based metrics. Together, these findings stress the need for stronger, multi-layer alignment and dynamic detection techniques to counter sophisticated narrative-based jailbreaks in real-world systems.
Abstract
In the era of rapid generative AI development, interactions with large language models (LLMs) pose increasing risks of misuse. Prior research has primarily focused on attacks using template-based prompts and optimization-oriented methods, while overlooking the fact that LLMs possess strong unconstrained deceptive capabilities to attack other LLMs. This paper introduces a novel jailbreaking method inspired by the Chain-of-Thought mechanism. The attacker employs mission transfer to conceal harmful user intent within dialogue and generates a progressive chain of lure questions without relying on predefined templates, enabling successful jailbreaks. To further improve the attack's strength, we incorporate a helper LLM model that performs randomized narrative optimization over multi-turn interactions, enhancing the attack performance while preserving alignment with the original intent. We also propose a toxicity-based framework using third-party LLMs to evaluate harmful content and its alignment with malicious intent. Extensive experiments demonstrate that our method consistently achieves high attack success rates and elevated toxicity scores across diverse types of LLMs under black-box API settings. These findings reveal the intrinsic potential of LLMs to perform unrestricted attacks in the absence of robust alignment constraints. Our approach offers data-driven insights to inform the design of future alignment mechanisms. Finally, we propose two concrete defense strategies to support the development of safer generative models.
