Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection

Zhilong Wang; Yebo Cao; Peng Liu

Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection

Zhilong Wang, Yebo Cao, Peng Liu

TL;DR

The paper addresses the problem of jailbreaks that can deceive both LLMs and human analysts by hiding malicious intent inside truthful content. It introduces logic-chain injection, a method that disassembles a malicious query into semantically equivalent narrations, embeds them into a benign article, and distributes them so a transformer-based model can connect the dispersed logic via attention, maximizing the chance of generating harmful output. Two concrete instantiations—Paragraphed Logic Chain and Acrostic Style Logic Chain—demonstrate how narrations can be scattered and chained within benign text. The authors contrast their approach with existing jailbreaks, arguing that it jointly targets model safeguards and human perception, and discuss reproducibility challenges due to defenses and model evolution. These findings underscore the need for robust defenses against stealthy, human-in-the-loop jailbreak strategies in LLM systems.

Abstract

Jailbreak attacks on Language Model Models (LLMs) entail crafting prompts aimed at exploiting the models to generate malicious content. Existing jailbreak attacks can successfully deceive the LLMs, however they cannot deceive the human. This paper proposes a new type of jailbreak attacks which can deceive both the LLMs and human (i.e., security analyst). The key insight of our idea is borrowed from the social psychology - that is human are easily deceived if the lie is hidden in truth. Based on this insight, we proposed the logic-chain injection attacks to inject malicious intention into benign truth. Logic-chain injection attack firstly dissembles its malicious target into a chain of benign narrations, and then distribute narrations into a related benign article, with undoubted facts. In this way, newly generate prompt cannot only deceive the LLMs, but also deceive human.

Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection

TL;DR

Abstract

Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection

Authors

TL;DR

Abstract

Table of Contents

Figures (2)