Table of Contents
Fetching ...

Beyond Prompts: Space-Time Decoupling Control-Plane Jailbreaks in LLM Structured Output

Shuoming Zhang, Jiacheng Zhao, Hanyuan Dong, Ruiyuan Xu, Zhicheng Li, Yangyu Zhang, Shuaijiang Li, Yuan Wen, Chunwei Xia, Zheng Wang, Xiaobing Feng, Huimin Cui

TL;DR

The paper identifies a critical control-plane vulnerability in LLM structured-output systems by showing that grammar constraints can be weaponized to bypass safety protections through Constrained Decoding Attacks (CDA). It introduces two concrete attacks, EnumAttack and DictAttack, and demonstrates their effectiveness across 13 models and five benchmarks, with DictAttack maintaining high jailbreak success even under layered guardrails. The work reveals a semantic gap in current defenses, highlighting the need for cross-plane auditing that analyzes both data-plane prompts and control-plane grammars, including cross-turn context. These findings have practical implications for deploying LLM-powered tooling and agents, urging the development of integrated security mechanisms that secure the entire generation pipeline rather than just the prompt or output boundaries.

Abstract

Content Warning: This paper may contain unsafe or harmful content generated by LLMs that may be offensive to readers. Large Language Models (LLMs) are extensively used as tooling platforms through structured output APIs to ensure syntax compliance so that robust integration with existing software, like agent systems, can be achieved. However, the feature enabling the functionality of grammar-guided structured output presents significant security vulnerabilities. In this work, we reveal a critical control-plane attack surface orthogonal to traditional data-plane vulnerabilities. We introduce Constrained Decoding Attack (CDA), a novel jailbreak class that weaponizes structured output constraints to bypass both external auditing and internal safety alignment. Unlike prior attacks focused on input prompt designs, CDA operates by embedding malicious intent in schema-level grammar rules (control-plane) while maintaining benign surface prompts (data-plane). We instantiate this with two proof-of-concept attacks: EnumAttack, which embeds malicious content in enum fields; and the more evasive DictAttack, which decouples the malicious payload across a benign prompt and a dictionary-based grammar. Our evaluation spans a broad spectrum of 13 proprietary/open-weight models. In particular, DictAttack achieves 94.3--99.5% ASR across five benchmarks on gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b. Furthermore, we demonstrate the significant challenge in defending against these threats: while basic grammar auditing mitigates EnumAttack, the more sophisticated DictAttack maintains a 75.8% ASR even against multiple state-of-the-art jailbreak guardrails. This exposes a critical "semantic gap" in current safety architectures and underscores the urgent need for cross-plane defenses that can bridge the data and control planes to secure the LLM generation pipeline.

Beyond Prompts: Space-Time Decoupling Control-Plane Jailbreaks in LLM Structured Output

TL;DR

The paper identifies a critical control-plane vulnerability in LLM structured-output systems by showing that grammar constraints can be weaponized to bypass safety protections through Constrained Decoding Attacks (CDA). It introduces two concrete attacks, EnumAttack and DictAttack, and demonstrates their effectiveness across 13 models and five benchmarks, with DictAttack maintaining high jailbreak success even under layered guardrails. The work reveals a semantic gap in current defenses, highlighting the need for cross-plane auditing that analyzes both data-plane prompts and control-plane grammars, including cross-turn context. These findings have practical implications for deploying LLM-powered tooling and agents, urging the development of integrated security mechanisms that secure the entire generation pipeline rather than just the prompt or output boundaries.

Abstract

Content Warning: This paper may contain unsafe or harmful content generated by LLMs that may be offensive to readers. Large Language Models (LLMs) are extensively used as tooling platforms through structured output APIs to ensure syntax compliance so that robust integration with existing software, like agent systems, can be achieved. However, the feature enabling the functionality of grammar-guided structured output presents significant security vulnerabilities. In this work, we reveal a critical control-plane attack surface orthogonal to traditional data-plane vulnerabilities. We introduce Constrained Decoding Attack (CDA), a novel jailbreak class that weaponizes structured output constraints to bypass both external auditing and internal safety alignment. Unlike prior attacks focused on input prompt designs, CDA operates by embedding malicious intent in schema-level grammar rules (control-plane) while maintaining benign surface prompts (data-plane). We instantiate this with two proof-of-concept attacks: EnumAttack, which embeds malicious content in enum fields; and the more evasive DictAttack, which decouples the malicious payload across a benign prompt and a dictionary-based grammar. Our evaluation spans a broad spectrum of 13 proprietary/open-weight models. In particular, DictAttack achieves 94.3--99.5% ASR across five benchmarks on gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b. Furthermore, we demonstrate the significant challenge in defending against these threats: while basic grammar auditing mitigates EnumAttack, the more sophisticated DictAttack maintains a 75.8% ASR even against multiple state-of-the-art jailbreak guardrails. This exposes a critical "semantic gap" in current safety architectures and underscores the urgent need for cross-plane defenses that can bridge the data and control planes to secure the LLM generation pipeline.

Paper Structure

This paper contains 29 sections, 2 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: (1) Prompt-based data plane jailbreak attack, mitigated by a guardrail, (2) EnumAttack, using structured output (LLM control plane) to embed malicious question, currently not guarded, (3) DictAttack, decoupling malicious payload into benign prompt and grammar, therefore jailbreaking system with both plane guardrails
  • Figure 2: Illustration of constrained decoding. At each step, a per-token mask is generated in a manner analogous to the lexer–parser workflow in compiler design: prior outputs are treated as a token stream, matched against grammar rules through a parsing process, and used to produce the mask. This mask is then applied during LLM decoding, ensuring the generated output conforms to the specified grammar.
  • Figure 3: Comparison of traditional data-plane jailbreak attempts versus control-plane attacks, illustrating how structured output constraints can be exploited to bypass safety mechanisms. While conventional refusal mechanisms effectively block direct harmful prompts (a), constrained decoding attacks can circumvent these protections by embedding malicious content within grammar specifications (b), resulting in successful jailbreaks that generate harmful content with detailed instructions (c).
  • Figure 4: Illustration of EnumAttack, where the malicious intent is embedded in the enum property of the JSON Schema (control plane), while the prompt (data plane) remains benign.
  • Figure 5: EnumAttack evaluation of using different grammar backends and serving engines with llama-3.1-8b, except negligible differences caused by temperature and minor variations in the backend, the attack is consistently successful.
  • ...and 4 more figures