Constrained Decoding with Speculative Lookaheads
Nishanth Nakshatri, Shamik Roy, Rajarshi Das, Suthee Chaidaroon, Leonid Boytsov, Rashmi Gangadharaiah
TL;DR
Constrained decoding of LLMs using lookahead heuristics (CDLH) achieves strong constraint satisfaction but at high computational cost. This work proposes constrained decoding with speculative lookaheads (CDSL), which drafts lookahead tokens with a small model, validates them with a larger target LLM, and scores them via a task-specific reward to decide acceptance, yielding substantial runtime speedups ($2.2\times$ to $12.15\times$) with minimal performance loss. Across two constraint tasks (CommonGen and Harmless Text Generation) and three LLM families, CDSL outperforms CDLH and the CDLH-Appx baseline in speed while maintaining competitive constraint satisfaction, demonstrating a practical balance between efficiency and alignment. The approach offers a flexible framework for deploying constraint-aware generation in real-world settings where inference speed is critical.
Abstract
Constrained decoding with lookahead heuristics (CDLH) is a highly effective method for aligning LLM generations to human preferences. However, the extensive lookahead roll-out operations for each generated token makes CDLH prohibitively expensive, resulting in low adoption in practice. In contrast, common decoding strategies such as greedy decoding are extremely efficient, but achieve very low constraint satisfaction. We propose constrained decoding with speculative lookaheads (CDSL), a technique that significantly improves upon the inference efficiency of CDLH without experiencing the drastic performance reduction seen with greedy decoding. CDSL is motivated by the recently proposed idea of speculative decoding that uses a much smaller draft LLM for generation and a larger target LLM for verification. In CDSL, the draft model is used to generate lookaheads which is verified by a combination of target LLM and task-specific reward functions. This process accelerates decoding by reducing the computational burden while maintaining strong performance. We evaluate CDSL in two constraint decoding tasks with three LLM families and achieve 2.2x to 12.15x speedup over CDLH without significant performance reduction.
