Table of Contents
Fetching ...

Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles

Zhilong Wang, Haizhou Wang, Nanqing Luo, Lan Zhang, Xiaoyan Sun, Yebo Cao, Peng Liu

TL;DR

This paper tackles the problem of jailbreaking safety-aligned LLMs in blackbox settings by exploiting self-attention through carefully crafted carrier articles that semantically accompany prohibited queries. The authors design an automated payload generation pipeline that constructs a hypernym article from WordNet, appends a query context, and injects the prohibited query at multiple insertion points to diffuse attention toward the malicious goal. Across JailbreakBench and four target models, the method achieves high success rates and generally outperforms prior blackbox approaches, with notable effects from carrier-topic alignment, article length, and decoding parameters. The work highlights a meaningful vulnerability in safety alignment, suggesting concrete considerations for defense, such as strengthening attention-aware safeguards and evaluating payload diversity in defensive testing.

Abstract

Large Language Model (LLM) jailbreak refers to a type of attack aimed to bypass the safeguard of an LLM to generate contents that are inconsistent with the safe usage guidelines. Based on the insights from the self-attention computation process, this paper proposes a novel blackbox jailbreak approach, which involves crafting the payload prompt by strategically injecting the prohibited query into a carrier article. The carrier article maintains the semantic proximity to the prohibited query, which is automatically produced by combining a hypernymy article and a context, both of which are generated from the prohibited query. The intuition behind the usage of carrier article is to activate the neurons in the model related to the semantics of the prohibited query while suppressing the neurons that will trigger the objectionable text. Carrier article itself is benign, and we leveraged prompt injection techniques to produce the payload prompt. We evaluate our approach using JailbreakBench, testing against four target models across 100 distinct jailbreak objectives. The experimental results demonstrate our method's superior effectiveness, achieving an average success rate of 63% across all target models, significantly outperforming existing blackbox jailbreak methods.

Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles

TL;DR

This paper tackles the problem of jailbreaking safety-aligned LLMs in blackbox settings by exploiting self-attention through carefully crafted carrier articles that semantically accompany prohibited queries. The authors design an automated payload generation pipeline that constructs a hypernym article from WordNet, appends a query context, and injects the prohibited query at multiple insertion points to diffuse attention toward the malicious goal. Across JailbreakBench and four target models, the method achieves high success rates and generally outperforms prior blackbox approaches, with notable effects from carrier-topic alignment, article length, and decoding parameters. The work highlights a meaningful vulnerability in safety alignment, suggesting concrete considerations for defense, such as strengthening attention-aware safeguards and evaluating payload diversity in defensive testing.

Abstract

Large Language Model (LLM) jailbreak refers to a type of attack aimed to bypass the safeguard of an LLM to generate contents that are inconsistent with the safe usage guidelines. Based on the insights from the self-attention computation process, this paper proposes a novel blackbox jailbreak approach, which involves crafting the payload prompt by strategically injecting the prohibited query into a carrier article. The carrier article maintains the semantic proximity to the prohibited query, which is automatically produced by combining a hypernymy article and a context, both of which are generated from the prohibited query. The intuition behind the usage of carrier article is to activate the neurons in the model related to the semantics of the prohibited query while suppressing the neurons that will trigger the objectionable text. Carrier article itself is benign, and we leveraged prompt injection techniques to produce the payload prompt. We evaluate our approach using JailbreakBench, testing against four target models across 100 distinct jailbreak objectives. The experimental results demonstrate our method's superior effectiveness, achieving an average success rate of 63% across all target models, significantly outperforming existing blackbox jailbreak methods.
Paper Structure (28 sections, 9 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 28 sections, 9 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Approach overview.
  • Figure 2: An example of our attacking payload. It begins with the query context (in gray color), followed by a carrier article (in black color) that contains the malicious query (in red color) embedded within it.
  • Figure 3: Searching for $n$-step hypernyms from subject word in WordNet.
  • Figure 4: Number of cumulative success with different number of attempts on the benchmark.
  • Figure 5: Compare with other blackbox Jailbreak attacks on the benchmark.
  • ...and 2 more figures