Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles
Zhilong Wang, Haizhou Wang, Nanqing Luo, Lan Zhang, Xiaoyan Sun, Yebo Cao, Peng Liu
TL;DR
This paper tackles the problem of jailbreaking safety-aligned LLMs in blackbox settings by exploiting self-attention through carefully crafted carrier articles that semantically accompany prohibited queries. The authors design an automated payload generation pipeline that constructs a hypernym article from WordNet, appends a query context, and injects the prohibited query at multiple insertion points to diffuse attention toward the malicious goal. Across JailbreakBench and four target models, the method achieves high success rates and generally outperforms prior blackbox approaches, with notable effects from carrier-topic alignment, article length, and decoding parameters. The work highlights a meaningful vulnerability in safety alignment, suggesting concrete considerations for defense, such as strengthening attention-aware safeguards and evaluating payload diversity in defensive testing.
Abstract
Large Language Model (LLM) jailbreak refers to a type of attack aimed to bypass the safeguard of an LLM to generate contents that are inconsistent with the safe usage guidelines. Based on the insights from the self-attention computation process, this paper proposes a novel blackbox jailbreak approach, which involves crafting the payload prompt by strategically injecting the prohibited query into a carrier article. The carrier article maintains the semantic proximity to the prohibited query, which is automatically produced by combining a hypernymy article and a context, both of which are generated from the prohibited query. The intuition behind the usage of carrier article is to activate the neurons in the model related to the semantics of the prohibited query while suppressing the neurons that will trigger the objectionable text. Carrier article itself is benign, and we leveraged prompt injection techniques to produce the payload prompt. We evaluate our approach using JailbreakBench, testing against four target models across 100 distinct jailbreak objectives. The experimental results demonstrate our method's superior effectiveness, achieving an average success rate of 63% across all target models, significantly outperforming existing blackbox jailbreak methods.
