Table of Contents
Fetching ...

SynCode: LLM Generation with Grammar Augmentation

Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, Gagandeep Singh

TL;DR

SynCode addresses the challenge of producing syntactically correct outputs from large language models by enforcing grammar constraints through an offline DFA mask store. It formalizes a two-step decoding framework that parses partial outputs to derive accept sequences and a remainder, then uses a DFA walk to assemble a mask over the vocabulary, constraining the next token to syntactically valid choices. The approach provides soundness guarantees and, under longer lookahead, completeness, and demonstrates major reductions in syntax errors across JSON, SQL, Python, and Go, along with favorable runtime characteristics due to offline preprocessing and GPU parallelism. The results suggest CFG-guided generation can be practical for real-world AI pipelines requiring strict formal-language fidelity, with modest offline costs and broad language support.

Abstract

LLMs are widely used in complex AI applications. These applications underscore the need for LLM outputs to adhere to a specific format, for their integration with other components in the systems. Typically the format rules e.g., for data serialization formats such as JSON, YAML, or Code in Programming Language are expressed as context-free grammar (CFG). Due to the hallucinations and unreliability of LLMs, instructing LLMs to adhere to specified syntax becomes an increasingly important challenge. We present SynCode, a novel framework for efficient and general syntactical decoding with LLMs, to address this challenge. SynCode ensures soundness and completeness with respect to the CFG of a formal language, effectively retaining valid tokens while filtering out invalid ones. SynCode uses an offline-constructed, efficient lookup table, the DFA mask store, derived from the DFA of the language's grammar for efficient generation. SynCode seamlessly integrates with any language defined by CFG, as evidenced by experiments focusing on generating JSON, Python, and Go outputs. Our experiments evaluating the effectiveness of SynCode for JSON generation demonstrate that SynCode eliminates all syntax errors and significantly outperforms state-of-the-art baselines. Furthermore, our results underscore how SynCode significantly reduces 96.07% of syntax errors in generated Python and Go code, showcasing its substantial impact on enhancing syntactical precision in LLM generation. Our code is available at https://github.com/uiuc-focal-lab/syncode

SynCode: LLM Generation with Grammar Augmentation

TL;DR

SynCode addresses the challenge of producing syntactically correct outputs from large language models by enforcing grammar constraints through an offline DFA mask store. It formalizes a two-step decoding framework that parses partial outputs to derive accept sequences and a remainder, then uses a DFA walk to assemble a mask over the vocabulary, constraining the next token to syntactically valid choices. The approach provides soundness guarantees and, under longer lookahead, completeness, and demonstrates major reductions in syntax errors across JSON, SQL, Python, and Go, along with favorable runtime characteristics due to offline preprocessing and GPU parallelism. The results suggest CFG-guided generation can be practical for real-world AI pipelines requiring strict formal-language fidelity, with modest offline costs and broad language support.

Abstract

LLMs are widely used in complex AI applications. These applications underscore the need for LLM outputs to adhere to a specific format, for their integration with other components in the systems. Typically the format rules e.g., for data serialization formats such as JSON, YAML, or Code in Programming Language are expressed as context-free grammar (CFG). Due to the hallucinations and unreliability of LLMs, instructing LLMs to adhere to specified syntax becomes an increasingly important challenge. We present SynCode, a novel framework for efficient and general syntactical decoding with LLMs, to address this challenge. SynCode ensures soundness and completeness with respect to the CFG of a formal language, effectively retaining valid tokens while filtering out invalid ones. SynCode uses an offline-constructed, efficient lookup table, the DFA mask store, derived from the DFA of the language's grammar for efficient generation. SynCode seamlessly integrates with any language defined by CFG, as evidenced by experiments focusing on generating JSON, Python, and Go outputs. Our experiments evaluating the effectiveness of SynCode for JSON generation demonstrate that SynCode eliminates all syntax errors and significantly outperforms state-of-the-art baselines. Furthermore, our results underscore how SynCode significantly reduces 96.07% of syntax errors in generated Python and Go code, showcasing its substantial impact on enhancing syntactical precision in LLM generation. Our code is available at https://github.com/uiuc-focal-lab/syncode
Paper Structure (37 sections, 10 theorems, 13 equations, 15 figures, 7 tables, 4 algorithms)

This paper contains 37 sections, 10 theorems, 13 equations, 15 figures, 7 tables, 4 algorithms.

Key Result

Lemma 1

Given $\Lambda = \{\tau_{f+1}, \tau_{f+2} \dots \tau_{f+d}\}$, $\Lambda^p = \{\tau_{f+2} \dots \tau_{f+d}\}$ and $\rho_\Lambda = (\rho_{f+1}, \rho_{f+2}, \ldots, \rho_{f+d})$, $\textit{dmatch}(w, q_0^{\tau_1}, \Lambda^p) \iff \textit{pmatch}(w, \rho_\Lambda)$.

Figures (15)

  • Figure 1: In the SynCode workflow, the LLM takes partial output $C_k$ and generates a distribution for the next token $t_{k+1}$. The parser processes $C_k$ to produce accept sequences $\mathcal{A}$ and remainder $r$. These values are used by the DFA mask store to create a token mask, eliminating syntactically invalid tokens. The LLM iteratively generates a token $t_{k+1}$ using the distribution and the mask, appending it to $C_k$ to create the updated code $C_{k+1}$. The process continues until the LLM returns the final code $C_n$ based on the defined stop condition.
  • Figure 2: Tokenization of a string.
  • Figure 3: Example grammar for illustration.
  • Figure 4: Prompt for the example which is provided as input to the LLM.
  • Figure 5: Output from LLM without and with SynCode. The colors represent the tokenization of the output.
  • ...and 10 more figures

Theorems & Definitions (27)

  • Definition 1: DFA
  • Definition 2: Lexer
  • Definition 3: Partial Outputs
  • Definition 4: Syntactical Decoding
  • Definition 5: Partial Parse
  • Definition 6: Partial Sentences
  • Definition 7: Accept Sequence
  • Definition 8: pmatch
  • Definition 9: DFA $\textit{live}$ states
  • Definition 10: dmatch
  • ...and 17 more