Guided Speculative Inference for Efficient Test-Time Alignment of LLMs
Jonathan Geuter, Youssef Mroueh, David Alvarez-Melis
TL;DR
Guided Speculative Inference (GSI) presents a provably grounded test-time decoding method that uses a small draft model to approximate a reward-tilted decoding distribution of a larger base model. By tilting the reward and leveraging reward-likelihood adjustments, GSI achieves a distributional approximation to the optimal policy $π_{β,B}$ while maintaining practical compute efficiency. Empirically, GSI improves reasoning performance across standard benchmarks, often approaching or surpassing soft best-of-n with the base model and outperforming prior reward-guided methods, with favorable latency-accuracy trade-offs. This work offers a principled, scalable framework for aligning LLM outputs to rewards under real-time constraints, with strong theoretical guarantees and practical impact for deployment.
Abstract
We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $π_S(y\mid x)$. We provably approximate both the optimal tilted policy $π_{β,B}(y\mid x) \propto π_B(y\mid x)\exp(β\,r(x,y))$ of soft best-of-$n$ under the base model $π_B$, as well as the expected reward under the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math, MMLU-STEM, GSM8K), our method achieves higher accuracy than standard soft best-of-$n$ with $π_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $π_B$. The code is available at https://github.com/j-geuter/GSI .
