Table of Contents
Fetching ...

Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation

Luca Beurer-Kellner, Marc Fischer, Martin Vechev

TL;DR

This work tackles the challenge of enforcing formal constraints in LLM generation without sacrificing accuracy or throughput. It introduces DOMINO, a subword-aligned constrained decoding algorithm that uses offline precomputation, a vocabulary-aligned subterminal tree, and online parsing to maintain expressive constraints with minimal overhead. The method is augmented with speculative decoding and opportunistic masking to further accelerate inference. Empirical results on GSM8K and CoNLL-2003 across Mistral 7B and Llama-2 13B show Domino delivers equal or better task accuracy than unconstrained generation while achieving substantial throughput gains, often surpassing unconstrained speeds. The findings suggest constrained generation can be effectively fused with LLMs to produce reliably structured outputs at high speed, enabling scalable, format-safe AI systems.

Abstract

To ensure that text generated by large language models (LLMs) is in an expected format, constrained decoding proposes to enforce strict formal language constraints during generation. However, as we show in this work, not only do such methods incur performance overhead during generation, but many of them also significantly impair task accuracy, if they do not correctly align the underlying LLM sub-word vocabularies with external constraints. To address this, we present a novel decoding algorithm, DOMINO, that can enforce constraints in a fully subword-aligned fashion, while leveraging pre-computation and speculative decoding to achieve virtually no overhead and in some cases even almost 2$\times$ speedup over unconstrained decoding -- thereby outperforming existing approaches by a wide margin.

Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation

TL;DR

This work tackles the challenge of enforcing formal constraints in LLM generation without sacrificing accuracy or throughput. It introduces DOMINO, a subword-aligned constrained decoding algorithm that uses offline precomputation, a vocabulary-aligned subterminal tree, and online parsing to maintain expressive constraints with minimal overhead. The method is augmented with speculative decoding and opportunistic masking to further accelerate inference. Empirical results on GSM8K and CoNLL-2003 across Mistral 7B and Llama-2 13B show Domino delivers equal or better task accuracy than unconstrained generation while achieving substantial throughput gains, often surpassing unconstrained speeds. The findings suggest constrained generation can be effectively fused with LLMs to produce reliably structured outputs at high speed, enabling scalable, format-safe AI systems.

Abstract

To ensure that text generated by large language models (LLMs) is in an expected format, constrained decoding proposes to enforce strict formal language constraints during generation. However, as we show in this work, not only do such methods incur performance overhead during generation, but many of them also significantly impair task accuracy, if they do not correctly align the underlying LLM sub-word vocabularies with external constraints. To address this, we present a novel decoding algorithm, DOMINO, that can enforce constraints in a fully subword-aligned fashion, while leveraging pre-computation and speculative decoding to achieve virtually no overhead and in some cases even almost 2 speedup over unconstrained decoding -- thereby outperforming existing approaches by a wide margin.
Paper Structure (42 sections, 1 theorem, 1 equation, 5 figures, 4 tables, 3 algorithms)

This paper contains 42 sections, 1 theorem, 1 equation, 5 figures, 4 tables, 3 algorithms.

Key Result

Lemma 3.1

Let $L_{G}$ be the language described by a CFG $G$. Further, let $r_1, \ldots, r_n$ be the regular expressions of the terminals of $G$ and the $r_{\textsc{EOS}} = \$$. Then, it holds that:

Figures (5)

  • Figure 1: Greedy (overly-invasive) constraining of LLMs can distort tokenization, leading to different output than with unconstrained decoding, even in the case where unconstrained generation would produce valid output for the same prompt. Gray boxes represent vocabulary tokens, orange hue is proportional to perplexity.
  • Figure 2: Template-based tokens, marked as , force unnatural tokenization and formatting, which can lead to different outputs and increased perplexity. Gray boxes represent vocabulary tokens, hue is proportional to perplexity.
  • Figure 3: Running example and overview of Domino. (a) shows an example grammar, (b) the character level NFA for this language, (d) one of the per-state subterminal trees for the grammar in (c). (e) shows how a parser can be used to prune this tree at inference time and obtain token masks efficiently by traversing the tree.
  • Figure 4: NFA for the int terminal from \ref{['fig:overview']} (a). Traversed from node int this NFA accepts all legal inputs for the terminal.
  • Figure 5: Impact of the number of speculative tokens $k$ on throughput (tokens per second) with Mistral 7B and JSON generation with and without schema, using Domino with LLMs.

Theorems & Definitions (2)

  • Definition 2.1: Minimally invasive
  • Lemma 3.1