Table of Contents
Fetching ...

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, Tianqi Chen

TL;DR

XGrammar tackles the overhead of constrained generation for large language models by using a byte-level pushdown automaton to represent context-free grammars and by dividing tokens into context-independent and context-dependent classes. It introduces an adaptive token mask cache, a persistent execution stack, and context expansion to drastically reduce runtime checks, while overlapping grammar computation with GPU inference to minimize overhead. The system delivers up to 100x per-token speedups and up to 80x end-to-end serving improvements, with strong gains in syntactic correctness for structured outputs. The work is open-sourced and designed for integration with major LLM frameworks, enabling scalable, structure-aware generation across diverse applications.

Abstract

The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

TL;DR

XGrammar tackles the overhead of constrained generation for large language models by using a byte-level pushdown automaton to represent context-free grammars and by dividing tokens into context-independent and context-dependent classes. It introduces an adaptive token mask cache, a persistent execution stack, and context expansion to drastically reduce runtime checks, while overlapping grammar computation with GPU inference to minimize overhead. The system delivers up to 100x per-token speedups and up to 80x end-to-end serving improvements, with strong gains in syntactic correctness for structured outputs. The work is open-sourced and designed for integration with major LLM frameworks, enabling scalable, structure-aware generation across diverse applications.

Abstract

The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.

Paper Structure

This paper contains 24 sections, 9 equations, 12 figures, 4 tables, 2 algorithms.

Figures (12)

  • Figure 1: Overview of our approach. XGrammar first uses a pushdown automaton to parse the prior LLM output, flexibly supporting diverse grammars and producing the matching stack states. It then uses the stack top to index into the adaptive token mask cache—our key optimization—to retrieve a partial mask. Most of the partial mask consists of context-independent tokens and is determined during preprocessing. A small portion, however, is context-dependent and resolved at runtime. This yields the complete token mask, thus enabling efficient constraint decoding.
  • Figure 2: Constrained decoding with per-token mask. The per-token mask prevents LLM from generating tokens that would be invalid according to the structure at that step.
  • Figure 3: Up: A context-free grammar for arrays and strings that can be recursively composed. This CFG is converted into the pushdown automata in Figure \ref{['fig:overview']}. denotes every character except and . Down: Two possible matching stacks for matching the string to the CFG. Each stack represents a possible expansion of the rules in the CFG. The edges and nodes in the stack correspond to the transitions and states in the PDA in Figure \ref{['fig:overview']}.
  • Figure 4: An example for the token mask cache. Tokens are categorized into three types: context-independent (accepted), context-independent (rejected), and context-dependent. The first two types can be directly determined for mask generation at runtime.
  • Figure 5: The adaptive storage format. In accept-heavy cases, we store the rejected tokens and context-dependent tokens. In reject-heavy cases, we store the accepted tokens and context-dependent tokens. In rare cases where two kinds of tokens are equal, we compress the accepted and rejected tokens into a bitset of the vocabulary size.
  • ...and 7 more figures