Table of Contents
Fetching ...

Constrained Decoding with Speculative Lookaheads

Nishanth Nakshatri, Shamik Roy, Rajarshi Das, Suthee Chaidaroon, Leonid Boytsov, Rashmi Gangadharaiah

TL;DR

Constrained decoding of LLMs using lookahead heuristics (CDLH) achieves strong constraint satisfaction but at high computational cost. This work proposes constrained decoding with speculative lookaheads (CDSL), which drafts lookahead tokens with a small model, validates them with a larger target LLM, and scores them via a task-specific reward to decide acceptance, yielding substantial runtime speedups ($2.2\times$ to $12.15\times$) with minimal performance loss. Across two constraint tasks (CommonGen and Harmless Text Generation) and three LLM families, CDSL outperforms CDLH and the CDLH-Appx baseline in speed while maintaining competitive constraint satisfaction, demonstrating a practical balance between efficiency and alignment. The approach offers a flexible framework for deploying constraint-aware generation in real-world settings where inference speed is critical.

Abstract

Constrained decoding with lookahead heuristics (CDLH) is a highly effective method for aligning LLM generations to human preferences. However, the extensive lookahead roll-out operations for each generated token makes CDLH prohibitively expensive, resulting in low adoption in practice. In contrast, common decoding strategies such as greedy decoding are extremely efficient, but achieve very low constraint satisfaction. We propose constrained decoding with speculative lookaheads (CDSL), a technique that significantly improves upon the inference efficiency of CDLH without experiencing the drastic performance reduction seen with greedy decoding. CDSL is motivated by the recently proposed idea of speculative decoding that uses a much smaller draft LLM for generation and a larger target LLM for verification. In CDSL, the draft model is used to generate lookaheads which is verified by a combination of target LLM and task-specific reward functions. This process accelerates decoding by reducing the computational burden while maintaining strong performance. We evaluate CDSL in two constraint decoding tasks with three LLM families and achieve 2.2x to 12.15x speedup over CDLH without significant performance reduction.

Constrained Decoding with Speculative Lookaheads

TL;DR

Constrained decoding of LLMs using lookahead heuristics (CDLH) achieves strong constraint satisfaction but at high computational cost. This work proposes constrained decoding with speculative lookaheads (CDSL), which drafts lookahead tokens with a small model, validates them with a larger target LLM, and scores them via a task-specific reward to decide acceptance, yielding substantial runtime speedups ( to ) with minimal performance loss. Across two constraint tasks (CommonGen and Harmless Text Generation) and three LLM families, CDSL outperforms CDLH and the CDLH-Appx baseline in speed while maintaining competitive constraint satisfaction, demonstrating a practical balance between efficiency and alignment. The approach offers a flexible framework for deploying constraint-aware generation in real-world settings where inference speed is critical.

Abstract

Constrained decoding with lookahead heuristics (CDLH) is a highly effective method for aligning LLM generations to human preferences. However, the extensive lookahead roll-out operations for each generated token makes CDLH prohibitively expensive, resulting in low adoption in practice. In contrast, common decoding strategies such as greedy decoding are extremely efficient, but achieve very low constraint satisfaction. We propose constrained decoding with speculative lookaheads (CDSL), a technique that significantly improves upon the inference efficiency of CDLH without experiencing the drastic performance reduction seen with greedy decoding. CDSL is motivated by the recently proposed idea of speculative decoding that uses a much smaller draft LLM for generation and a larger target LLM for verification. In CDSL, the draft model is used to generate lookaheads which is verified by a combination of target LLM and task-specific reward functions. This process accelerates decoding by reducing the computational burden while maintaining strong performance. We evaluate CDSL in two constraint decoding tasks with three LLM families and achieve 2.2x to 12.15x speedup over CDLH without significant performance reduction.

Paper Structure

This paper contains 34 sections, 11 figures, 8 tables, 2 algorithms.

Figures (11)

  • Figure 1: Inference speedup v/s performance on harmless text generation (Anthropic's HH-RLHF dataset). Apart from CDSL, we also propose a novel baseline (CDLH-appx) which uses the draft model to generate lookahead tokens for each beam. CDSL gains significant inference speedup w.r.t. CDLH and CDLH-appx without drastic performance reduction as compared to other decoding algorithms such as greedy decoding. Plot best viewed in color.
  • Figure 2: Effect of different hyperparameters on runtime ((a), (b), (c)) and constraint satisfaction performance ((d), (e), (f)) in the CommonGen task, for the model pairs (Bloomz-7.1B, Bloomz-1.7B) as (target, draft). Approval, reward thresholds, and b values are kept as $0.3$, $0.3$, $0$, respectively, when they are fixed.
  • Figure 3: Prompt template used for CommonGen task.
  • Figure 4: Prompt template used for harmless text generation task.
  • Figure 5: Prompt template used for scoring the generations using Llama-Guard-3-8B in the harmless text generation task.
  • ...and 6 more figures