Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation
Luca Beurer-Kellner, Marc Fischer, Martin Vechev
TL;DR
This work tackles the challenge of enforcing formal constraints in LLM generation without sacrificing accuracy or throughput. It introduces DOMINO, a subword-aligned constrained decoding algorithm that uses offline precomputation, a vocabulary-aligned subterminal tree, and online parsing to maintain expressive constraints with minimal overhead. The method is augmented with speculative decoding and opportunistic masking to further accelerate inference. Empirical results on GSM8K and CoNLL-2003 across Mistral 7B and Llama-2 13B show Domino delivers equal or better task accuracy than unconstrained generation while achieving substantial throughput gains, often surpassing unconstrained speeds. The findings suggest constrained generation can be effectively fused with LLMs to produce reliably structured outputs at high speed, enabling scalable, format-safe AI systems.
Abstract
To ensure that text generated by large language models (LLMs) is in an expected format, constrained decoding proposes to enforce strict formal language constraints during generation. However, as we show in this work, not only do such methods incur performance overhead during generation, but many of them also significantly impair task accuracy, if they do not correctly align the underlying LLM sub-word vocabularies with external constraints. To address this, we present a novel decoding algorithm, DOMINO, that can enforce constraints in a fully subword-aligned fashion, while leveraging pre-computation and speculative decoding to achieve virtually no overhead and in some cases even almost 2$\times$ speedup over unconstrained decoding -- thereby outperforming existing approaches by a wide margin.
