Jailbreaking LLMs via Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge
Ning Xu, Bo Gao, Hui Dou
TL;DR
The paper investigates jailbreak vulnerabilities in LLMs by focusing on semantically relevant nested scenarios that embed targeted toxic knowledge. It introduces RTS-Attack, a black-box framework that classifies harmful queries, generates adaptive nested scenarios via in-context learning, and customizes instructions to provoke harmful responses, all within three interaction rounds. Across six diverse LLMs, RTS-Attack achieves an average Harmfulness Score near 4.90 and an Attack Success Rate around 96.7%, while using only about 96 input tokens per attack, outperforming seven baselines. The work highlights the need for stronger alignment defenses and offers a universal framework for evaluating and advancing jailbreak research in real-world deployments.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks. However, they remain exposed to jailbreak attacks, eliciting harmful responses. The nested scenario strategy has been increasingly adopted across various methods, demonstrating immense potential. Nevertheless, these methods are easily detectable due to their prominent malicious intentions. In this work, we are the first to find and systematically verify that LLMs' alignment defenses are not sensitive to nested scenarios, where these scenarios are highly semantically relevant to the queries and incorporate targeted toxic knowledge. This is a crucial yet insufficiently explored direction. Based on this, we propose RTS-Attack (Semantically Relevant Nested Scenarios with Targeted Toxic Knowledge), an adaptive and automated framework to examine LLMs' alignment. By building scenarios highly relevant to the queries and integrating targeted toxic knowledge, RTS-Attack bypasses the alignment defenses of LLMs. Moreover, the jailbreak prompts generated by RTS-Attack are free from harmful queries, leading to outstanding concealment. Extensive experiments demonstrate that RTS-Attack exhibits superior performance in both efficiency and universality compared to the baselines across diverse advanced LLMs, including GPT-4o, Llama3-70b, and Gemini-pro. Our complete code is available at https://github.com/nercode/Work. WARNING: THIS PAPER CONTAINS POTENTIALLY HARMFUL CONTENT.
