Table of Contents
Fetching ...

Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

TL;DR

This work tackles the efficiency-accuracy trade-off in test-time scaling for reasoning-heavy language models. It introduces STAND, a model-free speculative decoding approach that uses a memory-efficient logit-based N-gram module, stochastic drafting, and data-driven draft-tree optimization to accelerate token generation without training. STAND achieves substantial latency reductions (notably $60\%$–$65\%$) while maintaining or improving throughput across multi-trajectory, single-trajectory, batch, and tree-search inference on AIME-2024, GPQA-Diamond, and LiveCodeBench, outperforming state-of-the-art speculative methods. By exploiting cross-trajectory redundancy and probabilistic drafting, STAND provides a plug-and-play acceleration framework applicable to existing LRMs with broad practical impact on reasoning tasks.

Abstract

Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.

Accelerated Test-Time Scaling with Model-Free Speculative Sampling

TL;DR

This work tackles the efficiency-accuracy trade-off in test-time scaling for reasoning-heavy language models. It introduces STAND, a model-free speculative decoding approach that uses a memory-efficient logit-based N-gram module, stochastic drafting, and data-driven draft-tree optimization to accelerate token generation without training. STAND achieves substantial latency reductions (notably ) while maintaining or improving throughput across multi-trajectory, single-trajectory, batch, and tree-search inference on AIME-2024, GPQA-Diamond, and LiveCodeBench, outperforming state-of-the-art speculative methods. By exploiting cross-trajectory redundancy and probabilistic drafting, STAND provides a plug-and-play acceleration framework applicable to existing LRMs with broad practical impact on reasoning tasks.

Abstract

Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.

Paper Structure

This paper contains 33 sections, 1 equation, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Scaling curve with speculative decoding. We report the scaling curve describing how the task performance improves with respect to the total decoding time. Keeping simple auto-regressive decoding total time as 1, we also report the scaling curves for different model-free SD methods. We report the reward-weighted majority voting accuracy for AIME-2024 and GPQA-Diamond, and pass@k for LiveCodeBench, where k is the total number of generated sequences generated at a given point. All measurements are made on a single A100 GPU with DeepSeek-R1-Distill-Qwen-7B.
  • Figure 2: N-gram overlaps across reasoning trajectories. We report the N-gram overlaps across different number of reasoning trajectories, generated by DeepSeek-R1-Distill-Qwen-7B on AIME-2024. The overlap is defined as the percentage of the N-grams that appear twice or more in the k reasoning trajectories, counting duplicates multiple times. We observe high n-gram overlaps across reasoning paths, presenting an opportunity for faster drafting.
  • Figure 3: Deterministic vs. stochastic drafting. We report the acceptance probability of a token, given a draft tree with depth 1 and width 3. Measurements are done using DeepSeek-R1-Distill-Qwen-7B model, and the draft tree is constructed using the N-gram module in STAND.
  • Figure 4: Overview of STAND. (Left) The N-gram module stores logits instead of discrete tokens, enabling stochastic drafting. When the language model generates "I am Bob", we store the probability distribution over the next token rather than just the sampled token. (Right) Data-driven draft tree optimization: We start with an initial large draft tree, measure node-wise acceptance rates during speculative decoding on real data, and prune to retain the most successful paths.
  • Figure 5: Structure of the Optimized Tree. We report the number of nodes at specific tree depths for draft trees optimized for each Token Recycle and STAND. Both trees are optimized on AIME-2024 dataset with DeepSeek-R1-Distill-Qwen-7B.
  • ...and 1 more figures