Automata-based constraints for language model decoding

Terry Koo; Frederick Liu; Luheng He

Automata-based constraints for language model decoding

Terry Koo, Frederick Liu, Luheng He

TL;DR

The paper presents an automata-based framework for constraining language model decoding to formal languages, addressing tokenization misalignment between subword tokenizers and grammar tokens. It reformulates detokenization as a finite-state transducer and composes FSTs with FSAs or PDAs to enforce regular or context-free constraints efficiently. The authors demonstrate a modular system with large speedups in constraint compilation, provable correctness, and extensibility, enabling applications such as schema-driven JSON and Python dataclasses and speculative decoding. This work provides a practical path to reliable constrained decoding at scale without bespoke tokenization tricks.

Abstract

Language models (LMs) are often expected to generate strings in some formal language; for example, structured data, API calls, or code snippets. Although LMs can be tuned to improve their adherence to formal syntax, this does not guarantee conformance, especially with smaller LMs suitable for large-scale deployment. In addition, tuning requires significant resources, making it impractical for uncommon or task-specific formats. To prevent downstream parsing errors we would ideally constrain the LM to only produce valid output, but this is severely complicated by tokenization, which is typically both ambiguous and misaligned with the formal grammar. We solve these issues through the application of automata theory, deriving an efficient closed-form solution for the regular languages, a broad class of formal languages with many practical applications, including API calls or schema-guided JSON and YAML. We also discuss pragmatic extensions for coping with the issue of high branching factor, and extend our techniques to deterministic context-free languages, which similarly admit an efficient closed-form solution. Previous work on this topic (Willard and Louf, 2023) layers bespoke solutions onto automata, leading to problems with speed, correctness, and extensibility. Instead, we reformulate the entire task in terms of automata so we can leverage well-studied and well-optimized algorithms. Our system compiles constraints ~7,000x faster, is provably correct, and can be extended in a modular fashion.

Automata-based constraints for language model decoding

TL;DR

Abstract

Paper Structure (48 sections, 7 theorems, 9 figures, 3 tables, 3 algorithms)

This paper contains 48 sections, 7 theorems, 9 figures, 3 tables, 3 algorithms.

Introduction
Finite-state constraints
Finite-state automata (FSAs)
Finite-state transducers (FSTs)
Detokenization as transduction
Adapting regular expressions to tokens
Extensions
Wildcard matching
Syntactic sugar
Push-down constraints
Push-down automata (PDAs)
Adapting grammars to tokens
Related work
Automata for sequence models
Automata for general-purpose LMs
...and 33 more sections

Key Result

Lemma 1

Every valid transduction by ${\cal T}_V$ starts and ends at $q_r$, and traverses zero or more of the $|V|$ cycles, in any order.

Figures (9)

Figure 1: FSAs that accept ab (left), odd numbers of as (center), and runs of as or bs (right). States are depicted as circles, with the start state in bold and final states doubled. Edges are depicted as directed arcs, labeled with the relevant input symbol, or $\epsilon$ if none.
Figure 2: The FSA constructed from the regular expression /a+|ab/ is initially non-deterministic (left), but can be determinized (right).
Figure 3: FSTs that transduce ab into x (left), odd numbers of as into xoxo $\cdot$$\cdot$$\cdot$x (center), and runs of as or bs into bracketed versions of themselves (right). Edge labels are $e^{\sigma}$:$e^{\delta}$.
Figure 4: A simple vocabulary of tokens (left), and a detokenizing FST that transduces sequences of those tokens into sequences of characters (right).
Figure 5: The character-based FSA equivalent to /(foo)+d/ (left) and its composition with the detokenizing FST from Figure \ref{['fig:fst-vocab']} (right). Note that the same text can have many tokenizations (e.g., foo vs foo), and tokens are allowed to cross sub-expression boundaries (e.g., food merges the last repeat of /(foo)+/ with /d/).
...and 4 more figures

Theorems & Definitions (12)

Lemma 1
proof
Corollary 1
Corollary 2
Corollary 3
proof
Theorem 1
proof
Theorem 2
proof
...and 2 more

Automata-based constraints for language model decoding

TL;DR

Abstract

Automata-based constraints for language model decoding

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (12)