Table of Contents
Fetching ...

$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts

Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer, Ion Stoica, Kannan Ramchandran, Ahmad Beirami, Ziteng Sun

TL;DR

This work proposes a latency-aware test-time scaling method inspired by speculative decoding, and introduces new integration strategies, including reward-guided soft verification and a reward-based deferral mechanism.

Abstract

Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration. However, increased compute often comes at the expense of higher user-facing latency, directly impacting user experience. Current test-time scaling methods primarily optimize for accuracy based on total compute resources (FLOPS), often overlooking latency constraints. To address this gap, we propose $\texttt{SPECS}$, a latency-aware test-time scaling method inspired by speculative decoding. $\texttt{SPECS}$~uses a smaller, faster model to generate candidate sequences efficiently, and evaluates these candidates using signals from both a larger target model and a dedicated reward model. We introduce new integration strategies, including reward-guided soft verification and a reward-based deferral mechanism. Empirical results on MATH500, AMC23 and OlympiadBench datasets show that $\texttt{SPECS}$~matches or surpasses beam search accuracy while reducing latency by up to $\sim$19.1\%. Our theoretical analysis shows that our algorithm converges to the solution of a KL-regularized reinforcement learning objective with increasing beam width.

$\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts

TL;DR

This work proposes a latency-aware test-time scaling method inspired by speculative decoding, and introduces new integration strategies, including reward-guided soft verification and a reward-based deferral mechanism.

Abstract

Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration. However, increased compute often comes at the expense of higher user-facing latency, directly impacting user experience. Current test-time scaling methods primarily optimize for accuracy based on total compute resources (FLOPS), often overlooking latency constraints. To address this gap, we propose , a latency-aware test-time scaling method inspired by speculative decoding. ~uses a smaller, faster model to generate candidate sequences efficiently, and evaluates these candidates using signals from both a larger target model and a dedicated reward model. We introduce new integration strategies, including reward-guided soft verification and a reward-based deferral mechanism. Empirical results on MATH500, AMC23 and OlympiadBench datasets show that ~matches or surpasses beam search accuracy while reducing latency by up to 19.1\%. Our theoretical analysis shows that our algorithm converges to the solution of a KL-regularized reinforcement learning objective with increasing beam width.

Paper Structure

This paper contains 46 sections, 14 theorems, 92 equations, 7 figures, 7 tables, 2 algorithms.

Key Result

Theorem 1

Assume $n \ge 3$ and suppose SPECS is implemented with the idealized PRM as defined in def:prm with beam-width $n$. Then, for reasoning problems over $H$ blocks, in the finite-block length setting, the policy $\pi_{{\texttt{SPECS}}\xspace}$ returned by SPECS satisfies, Here, we assume that the PRM reward range is $[0,R]$. The block-level coverage coefficient $C_{\mathtt{block}}$ is, where $\pi

Figures (7)

  • Figure 1: Visualization of Beam Search vs SPECS. In beam search the trajectories are generated by the target model ($p$) and scored using a PRM ($r$). In contrast, in SPECS the beams are dynamically switched to generation from the draft model, and scored by a combination of target and PRM model, resulting in better latency-performance tradeoff. Draft proposal and selection are further controlled by the SubSample subroutine (see \ref{['sec:algorithm']} for details).
  • Figure 2: (a) Latency of generation from the target model (Qwen2.5-7B-Instruct) vs. generation from the draft model (Qwen2.5-1.5B-Instruct) with scoring: we observe that latency savings from using the draft model to generate candidate blocks overcomes the overhead of scoring by the target model and PRM. (b) We generate the first $8$ steps of reasoning from the target model, and complete the remaining steps either using the draft model or the target model. The initial $8$-step partial reasoning traces generated by the target model are bucketed into high reward (PRM score at least $0.5$) and low reward (PRM score at most $0.5$). Using the draft model to complete high reward traces solves a very similar proportion of problems compared to if they were completed using the target model. The performance is dismal ($\approx 0$) when draft model is used to complete low reward traces.
  • Figure 3: Accuracy vs per-query average latency curves for SPECS and beam search with draft model, target model and RSD. The error bars show the standard deviation computed over 3 independent runs. The draft model is Qwen2.5-1.5B-Instruct and the target model is Qwen2.5-7B-Instruct.
  • Figure 4: Changing the value of the threshold $\tau$ gives us an accuracy-latency pareto curve for SPECS. As a sanity check, when $\tau \to 1$, SPECS collapses to beam search with the large model, and the corresponding points coincide.
  • Figure 5: Latency breakdown of SPECS
  • ...and 2 more figures

Theorems & Definitions (31)

  • Definition 1: Optimal KL-regularized value function
  • Definition 2: Idealized Process Reward Model / optimal KL-regularized advantage function
  • Theorem 1
  • Lemma 1: Equivalence of KL minimization and KL-regularized reward maximization
  • proof
  • Definition 3: Infinite block-length regime
  • Theorem 2: Guarantee for SPECS in the infinite-block length regime
  • Remark 1
  • Definition 4: Sequential Monte Carlo (SMC)
  • Definition 5: Exponentiated score
  • ...and 21 more