Table of Contents
Fetching ...

Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation

Junyi Chen, Shihao Bai, Zaijun Wang, Siyu Wu, Chuheng Du, Hailong Yang, Ruihao Gong, Shengzhong Liu, Fan Wu, Guihai Chen

TL;DR

Constrained decoding for structured generation is bottlenecked by nondeterministic transitions when using $LR(1)$ grammars. Pre$^3$ builds a $DPDA$ from the LR(1) state graph via prefix-conditioned edges and cycle-aware construction to remove runtime path exploration and enable ahead-of-time edge optimizations, yielding parallel transition processing. Key contributions include an algorithm to transform LR(1) to $DPDA$, prefix-conditioned edge optimization, cycle-aware DPDA construction, and edge-merge techniques, all integrated with mainstream LLM inference frameworks. Empirically, Pre$^3$ achieves up to 40% improvement in time-per-output-token ($TPOT$) and up to 36% higher throughput, with a practical preprocessing cost of 3–5 seconds for complex grammars and strong scalability to large batches, enabling efficient, deterministic structured generation in real-world deployments.

Abstract

Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre$^3$ that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre$^3$ enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre$^3$ introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead. Pre$^3$ can be seamlessly integrated into standard LLM inference frameworks, reducing time per output token (TPOT) by up to 40% and increasing throughput by up to 36% in our experiments. Our code is available at https://github.com/ModelTC/lightllm.

Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation

TL;DR

Constrained decoding for structured generation is bottlenecked by nondeterministic transitions when using grammars. Pre builds a from the LR(1) state graph via prefix-conditioned edges and cycle-aware construction to remove runtime path exploration and enable ahead-of-time edge optimizations, yielding parallel transition processing. Key contributions include an algorithm to transform LR(1) to , prefix-conditioned edge optimization, cycle-aware DPDA construction, and edge-merge techniques, all integrated with mainstream LLM inference frameworks. Empirically, Pre achieves up to 40% improvement in time-per-output-token () and up to 36% higher throughput, with a practical preprocessing cost of 3–5 seconds for complex grammars and strong scalability to large batches, enabling efficient, deterministic structured generation in real-world deployments.

Abstract

Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead. Pre can be seamlessly integrated into standard LLM inference frameworks, reducing time per output token (TPOT) by up to 40% and increasing throughput by up to 36% in our experiments. Our code is available at https://github.com/ModelTC/lightllm.

Paper Structure

This paper contains 38 sections, 4 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of Pre$^3$: The figure depicts the workflow from LR(1) grammar to DPDA-based generation, encompassing DPDA construction and optimization steps.
  • Figure 2: This diagram illustrates prefix-conditioned edges: above shows the case before calculation, where 'a' is a context-dependent token requiring runtime context for transition; below shows the precomputed case, where each edge includes a stack-matching condition, uniquely determining the transition path via the condition and transition symbol.
  • Figure 3: Two edge types for DPDA computation: blue edges are acceptance edges (existing in the original LR(1) graph, handling stack operations for acceptance); orange edges are reduction edges (added to the DPDA, matching and popping stack operations for reductions); gray edges depict LR(1) reduction paths, demonstrating fewer nodes needed for reduction after state machine construction.
  • Figure 4: (a) Pushdown automaton with an infinite cycle between State 1, 2, 3, 4, leading to an infinite number of possible paths and indeterminable transition paths when adding reduction edges at State 5; (b) How our method handles the cycle issue: The back-edge from State 4 to State 1 is modified to check for complete cycle traversal information (e.g., [1, 2, 3, 4]) in the stack. If detected, it pops the redundant state (e.g., [1, 2, 3, 4]), ensuring reduction edges at State 5 only need to account for traversals without cycles.
  • Figure 5: Two different types of edge optimization.
  • ...and 2 more figures