Table of Contents
Fetching ...

Guided Speculative Inference for Efficient Test-Time Alignment of LLMs

Jonathan Geuter, Youssef Mroueh, David Alvarez-Melis

TL;DR

Guided Speculative Inference (GSI) presents a provably grounded test-time decoding method that uses a small draft model to approximate a reward-tilted decoding distribution of a larger base model. By tilting the reward and leveraging reward-likelihood adjustments, GSI achieves a distributional approximation to the optimal policy $π_{β,B}$ while maintaining practical compute efficiency. Empirically, GSI improves reasoning performance across standard benchmarks, often approaching or surpassing soft best-of-n with the base model and outperforming prior reward-guided methods, with favorable latency-accuracy trade-offs. This work offers a principled, scalable framework for aligning LLM outputs to rewards under real-time constraints, with strong theoretical guarantees and practical impact for deployment.

Abstract

We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $π_S(y\mid x)$. We provably approximate both the optimal tilted policy $π_{β,B}(y\mid x) \propto π_B(y\mid x)\exp(β\,r(x,y))$ of soft best-of-$n$ under the base model $π_B$, as well as the expected reward under the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math, MMLU-STEM, GSM8K), our method achieves higher accuracy than standard soft best-of-$n$ with $π_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $π_B$. The code is available at https://github.com/j-geuter/GSI .

Guided Speculative Inference for Efficient Test-Time Alignment of LLMs

TL;DR

Guided Speculative Inference (GSI) presents a provably grounded test-time decoding method that uses a small draft model to approximate a reward-tilted decoding distribution of a larger base model. By tilting the reward and leveraging reward-likelihood adjustments, GSI achieves a distributional approximation to the optimal policy while maintaining practical compute efficiency. Empirically, GSI improves reasoning performance across standard benchmarks, often approaching or surpassing soft best-of-n with the base model and outperforming prior reward-guided methods, with favorable latency-accuracy trade-offs. This work offers a principled, scalable framework for aligning LLM outputs to rewards under real-time constraints, with strong theoretical guarantees and practical impact for deployment.

Abstract

We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of- test-time scaling with a reward model and speculative samples from a small auxiliary model . We provably approximate both the optimal tilted policy of soft best-of- under the base model , as well as the expected reward under the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math, MMLU-STEM, GSM8K), our method achieves higher accuracy than standard soft best-of- with and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of- with . The code is available at https://github.com/j-geuter/GSI .

Paper Structure

This paper contains 23 sections, 4 theorems, 42 equations, 9 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $x\in \mathcal{X}$. Assume that the coverage assumption (Assumption ass:coverage) holds. Let $u\in\mathbb{R}$ be an acceptance threshold (cmp. Algorithm alg:gsi_algo), and $\epsilon>0$ be an arbitrary accuracy. Assume that Then,

Figures (9)

  • Figure 1: Guided Speculative Inference workflow for one reasoning step. A sample $y^S_{i^*}$ generated from the draft model $\pi_S$ is selected with soft best-of-$n$ (S-BoN) with parameter $\beta$ from the tilted rewards$\tilde{r}_i$. If its reward lies above a threshold $u$ it is accepted. Otherwise, it is rejected, which triggers resampling from the target model $\pi_B$ with soft best-of-$n$.
  • Figure 2: GSI outperforms RSD rsd, soft best-of-$n$ with the draft model, and approaches the performance of soft best-of-$n$ with the base model. We also compare against GSI without rejection step. The plots contain $95\%$ confidence intervals over three random seeds.
  • Figure 3: Reasoning traces generated by GSI on MATH500. Top: GSI correctly identifies that the second step generated by the draft model $\pi_S$ is wrong (crossed out means rejected) and resamples from the base model $\pi_B$. Bottom: Sometimes, GSI rejects steps that are correct if $\pi_B$ words them very differently from $\pi_S$.
  • Figure 4: Acceptance ratios for GSI and RSD across datasets, with $95\%$ confidence intervals. As $n$ increases, the acceptance ratio of GSI approaches $90\%$. The acceptance ratio of RSD is much higher and converges to almost $100\%$ as $n$ increases, which means RSD effectively collapses to soft best-of-$n$ with $\pi_S$.
  • Figure 5: Acceptance ratio of GSI for different values of $\beta$ on MATH500. A sharp phase transition between $\beta=8$ and $\beta=20$ can be observed.
  • ...and 4 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Theorem 2: informal
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Remark