Table of Contents
Fetching ...

Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models

Haotian Ye, Himanshu Jain, Chong You, Ananda Theertha Suresh, Haowei Lin, James Zou, Felix Yu

TL;DR

Constrained decoding equips LLM outputs to lie within a predefined set but can bias the output distribution and incurs inefficiencies when implemented with CPU based data structures during GPU inference. The authors introduce Dynamic Importance Sampling for Constrained Decoding with a GPU friendly Parallel Prefix Verification primitive, yielding asymptotically unbiased constrained sampling and significant speedups. They prove theoretical bounds on KL divergence and expected sampling steps, and demonstrate comprehensive empirical gains across 20 datasets and four tasks, including up to 8.5x faster decoding and improved Micro F1 and R-Precision over trie based methods. The approach is modular and broadly applicable to various constraint types beyond simple set constraints, offering a practical path to reliable and scalable constrained generation in real world applications.

Abstract

In real-world applications of large language models, outputs are often required to be confined: selecting items from predefined product or document sets, generating phrases that comply with safety standards, or conforming to specialized formatting styles. To control the generation, constrained decoding has been widely adopted. However, existing prefix-tree-based constrained decoding is inefficient under GPU-based model inference paradigms, and it introduces unintended biases into the output distribution. This paper introduces Dynamic Importance Sampling for Constrained Decoding (DISC) with GPU-based Parallel Prefix-Verification (PPV), a novel algorithm that leverages dynamic importance sampling to achieve theoretically guaranteed asymptotic unbiasedness and overcomes the inefficiency of prefix-tree. Extensive experiments demonstrate the superiority of our method over existing methods in both efficiency and output quality. These results highlight the potential of our methods to improve constrained generation in applications where adherence to specific constraints is essential.

Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models

TL;DR

Constrained decoding equips LLM outputs to lie within a predefined set but can bias the output distribution and incurs inefficiencies when implemented with CPU based data structures during GPU inference. The authors introduce Dynamic Importance Sampling for Constrained Decoding with a GPU friendly Parallel Prefix Verification primitive, yielding asymptotically unbiased constrained sampling and significant speedups. They prove theoretical bounds on KL divergence and expected sampling steps, and demonstrate comprehensive empirical gains across 20 datasets and four tasks, including up to 8.5x faster decoding and improved Micro F1 and R-Precision over trie based methods. The approach is modular and broadly applicable to various constraint types beyond simple set constraints, offering a practical path to reliable and scalable constrained generation in real world applications.

Abstract

In real-world applications of large language models, outputs are often required to be confined: selecting items from predefined product or document sets, generating phrases that comply with safety standards, or conforming to specialized formatting styles. To control the generation, constrained decoding has been widely adopted. However, existing prefix-tree-based constrained decoding is inefficient under GPU-based model inference paradigms, and it introduces unintended biases into the output distribution. This paper introduces Dynamic Importance Sampling for Constrained Decoding (DISC) with GPU-based Parallel Prefix-Verification (PPV), a novel algorithm that leverages dynamic importance sampling to achieve theoretically guaranteed asymptotic unbiasedness and overcomes the inefficiency of prefix-tree. Extensive experiments demonstrate the superiority of our method over existing methods in both efficiency and output quality. These results highlight the potential of our methods to improve constrained generation in applications where adherence to specific constraints is essential.

Paper Structure

This paper contains 38 sections, 3 theorems, 24 equations, 3 figures, 6 tables, 1 algorithm.

Key Result

Theorem 2.1

For arbitrary probability value $p_b$, there exists an autoregressive model $L$ and a candidate set $\mathcal{S}$ with $p_b = 1 - P_L(\mathcal{S})$, such that the KL divergenceFor two discrete distribution $P, Q$ over set $\mathcal{X}$, the KL divergence is defined as $KL(P\|Q) = \sum_{x\in \mathcal

Figures (3)

  • Figure 1: An illustration of how biased generation occurs when constrained decoding is applied. A user requests the model to recommend a product related to "soccer shoes". The probability of selecting each token is represented by the connection lines, where dashed lines with ✕ indicate invalid tokens (e.g., "soccer shoes" and "used soccer gloves" are not available in the recommendation list). Even though the probability of "soccer gloves" ($0.06$) is much lower than "used soccer shoes" ($0.324$), the model generates the former with a higher final probability ($0.6 \times 1$) than the latter ($0.4 \times 0.9 \times 1$), since the model shortsightedly selects "soccer", unaware of the invalidity of "soccer shoes".
  • Figure 2: An illustration of PPV. Assume the input is $[0,31,555]$, and the top-three token candidates given by the LLM are $\{35, 145,111\}$. PPV concatenates them to the input and forms a three-row matrix. Then, it compares each row with the prepared array $\mathcal{X}$ where each row is a keyword in $\mathcal{S}$, in parallel. Since $\mathcal{X}$ is ordered alphabetically, the comparison can be performed via a binary search. After the comparison, we can determine whether each of the three partial outputs is the prefix of a row in $\mathcal{X}$. PPV finishes by returning a vector mask indicating the validity of each candidate.
  • Figure 3: Time vs. Performance Comparison on Document Retrieval and Entity Disambiguation Tasks for all methods. The y-axis presents performance (R-Precision for Document Retrieval and Micro F1 scores for Entity Disambiguation), and the x-axis presents inference time that is average across all tasks belonging to that task. The improvement on time (x-axis) is resulted from PPV, and the improvement on performance (y-axis) is resulted from DISC, respectively.

Theorems & Definitions (3)

  • Theorem 2.1
  • Theorem 3.1
  • Lemma 7.1