Table of Contents
Fetching ...

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

Xiaobing Sun, Perry Lam, Shaohua Li, Zizhou Wang, Rick Siow Mong Goh, Yong Liu, Liangli Zhen

Abstract

Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.

Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

Abstract

Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.
Paper Structure (42 sections, 10 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 42 sections, 10 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comprison between existing approaches and our proposed Structured Semantic Cloaking. Left panel: the original malicious query was transformed into a Base64 string. Despite the hidden semantic intent, the LLM recognised the pattern and rejected it. Right panel: our S2C framework adopted multi-dimensional and deeper cloaking strategies, requiring multi-step reasoning and co-reference resolution to reconstruct the intent.
  • Figure 2: Per-sample logprob difference comparison between original and fragmented prompts. 'Original'/'Fragmented' refers to the logprob difference value $D(y^r, y^a)$ on the original / Fragmented query, respectively. Points above the diagonal line indicate that the original prompt has higher rejection rate than the corresponding fragmented prompt.
  • Figure 3: Overview of our S2C framework. The framework includes three components: contextual reframing, content fragmentation, and clue-guided camouflage. The contextual reframing rewrites the initial query into a scenario script. The script is fragmented into a redacted script and several key terms (e.g., sensitive words, noun/verb phrases, etc). The key terms are transformed into neutral clues. The clue crafting method will be sampled from the pool each time.
  • Figure 4: Distribution of Clue Crafting Methods in Successful Jailbreak Attacks. 'Char-N': Char Noise. 'Emo-N': Emoji Noise.
  • Figure 5: ASRs and RSRs on isolated single clue crafting methods across models on JBB-Behaviors. Each time, only one method is considered for all the queries. 'Char-N': Char Noise
  • ...and 1 more figures