Table of Contents
Fetching ...

Flexible and Efficient Grammar-Constrained Decoding

Kanghee Park, Timothy Zhou, Loris D'Antoni

TL;DR

To ensure LLM outputs adhere to CFG constraints, this paper introduces GreatGramma, a grammar-constrained decoding framework that splits work into offline preprocessing and online masking. The core idea is a token-to-terminal transducer built from a lexer automaton and the LLM vocabulary, enabling precomputation of realizable terminal sequences and an inverse token-spanner mapping to quickly mask invalid tokens during decoding. Experiments show GreatGramma achieves 17.71x faster offline preprocessing than SynCode and the best online masking latency among baselines, while remaining robust. The work also highlights soundness bugs in prior GCD implementations and emphasizes a simple, modular implementation around 900 lines of Python. This approach enables flexible, scalable GCD for dynamic grammars in domains like program synthesis and grammar prompting.

Abstract

Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can align with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.

Flexible and Efficient Grammar-Constrained Decoding

TL;DR

To ensure LLM outputs adhere to CFG constraints, this paper introduces GreatGramma, a grammar-constrained decoding framework that splits work into offline preprocessing and online masking. The core idea is a token-to-terminal transducer built from a lexer automaton and the LLM vocabulary, enabling precomputation of realizable terminal sequences and an inverse token-spanner mapping to quickly mask invalid tokens during decoding. Experiments show GreatGramma achieves 17.71x faster offline preprocessing than SynCode and the best online masking latency among baselines, while remaining robust. The work also highlights soundness bugs in prior GCD implementations and emphasizes a simple, modular implementation around 900 lines of Python. This approach enables flexible, scalable GCD for dynamic grammars in domains like program synthesis and grammar prompting.

Abstract

Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can align with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.

Paper Structure

This paper contains 34 sections, 10 theorems, 4 equations, 4 figures, 8 tables, 6 algorithms.

Key Result

Proposition 3.5

If a PDA $\mathcal{P}$ accepts an input sequence $w$ in state $q$ with stack configuration $\gamma$, then $w$ is also accepted in the same state $q$ when the stack configuration is $\gamma' \cdot \gamma$ for some $\gamma'$ (i.e., when $\gamma$ appears at the top of the stack with additional symbols

Figures (4)

  • Figure 1: Illustrative example of the approach implemented in GreatGramma.
  • Figure 2: A lexing transducer $\mathcal{T}_\mathcal{A}$ derived from FSA $\mathcal{A}$ in \ref{['fig:overview']}.
  • Figure 3: Detokenizing transducer for vocabulary $\mathcal{V}=\{\texttt{a}, \texttt{b}, \texttt{c}, \texttt{ab}, \texttt{ac}, \texttt{aba}\}$.
  • Figure 4: A determinized token-level lexing transducer $\mathcal{T}_{\mathcal{A} \circ \mathcal{V}}$, which is formed by composing $\mathcal{T}_\mathcal{V}$ from \ref{['fig:detokenizing']} and $\mathcal{T}_\mathcal{A}$ from \ref{['fig:char-transducer']}.

Theorems & Definitions (20)

  • Definition 3.1: 1-lookahead
  • Definition 3.2: Producible Terminals
  • Definition 3.3: Realizable Terminal Sequences
  • Definition 3.4: Inverse Token Spanner Table
  • Proposition 3.5: Stack Invariance
  • Proposition 3.6: Overapproximation via FSA
  • Lemma 3.1
  • proof
  • Theorem 3.2
  • Proposition 3.3
  • ...and 10 more