Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation
Junyi Chen, Shihao Bai, Zaijun Wang, Siyu Wu, Chuheng Du, Hailong Yang, Ruihao Gong, Shengzhong Liu, Fan Wu, Guihai Chen
TL;DR
Constrained decoding for structured generation is bottlenecked by nondeterministic transitions when using $LR(1)$ grammars. Pre$^3$ builds a $DPDA$ from the LR(1) state graph via prefix-conditioned edges and cycle-aware construction to remove runtime path exploration and enable ahead-of-time edge optimizations, yielding parallel transition processing. Key contributions include an algorithm to transform LR(1) to $DPDA$, prefix-conditioned edge optimization, cycle-aware DPDA construction, and edge-merge techniques, all integrated with mainstream LLM inference frameworks. Empirically, Pre$^3$ achieves up to 40% improvement in time-per-output-token ($TPOT$) and up to 36% higher throughput, with a practical preprocessing cost of 3–5 seconds for complex grammars and strong scalability to large batches, enabling efficient, deterministic structured generation in real-world deployments.
Abstract
Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre$^3$ that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre$^3$ enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre$^3$ introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead. Pre$^3$ can be seamlessly integrated into standard LLM inference frameworks, reducing time per output token (TPOT) by up to 40% and increasing throughput by up to 36% in our experiments. Our code is available at https://github.com/ModelTC/lightllm.
