Table of Contents
Fetching ...

SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

Rui Pan, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro, Zhihao Jia, Ravi Netravali

TL;DR

The paper tackles the latency bottleneck of inference-time reasoning in Large Reasoning Models caused by long chains of thought. It introduces SpecReason, which speculatively offloads easier intermediate reasoning steps to a lightweight model while the base model verifies and corrects as needed. This yields 1.4–3.0× speedups with 0.4–9.0% accuracy gains across reasoning benchmarks, and further latency reductions when combined with speculative decoding. The work provides tunable accuracy-latency tradeoffs via acceptance thresholds and early-forcing options, and is open-sourced for practical adoption.

Abstract

Recent advances in inference-time compute have significantly improved performance on complex tasks by generating long chains of thought (CoTs) using Large Reasoning Models (LRMs). However, this improved accuracy comes at the cost of high inference latency due to the length of generated reasoning sequences and the autoregressive nature of decoding. Our key insight in tackling these overheads is that LRM inference, and the reasoning that it embeds, is highly tolerant of approximations: complex tasks are typically broken down into simpler steps, each of which brings utility based on the semantic insight it provides for downstream steps rather than the exact tokens it generates. Accordingly, we introduce SpecReason, a system that automatically accelerates LRM inference by using a lightweight model to (speculatively) carry out simpler intermediate reasoning steps and reserving the costly base model only to assess (and potentially correct) the speculated outputs. Importantly, SpecReason's focus on exploiting the semantic flexibility of thinking tokens in preserving final-answer accuracy is complementary to prior speculation techniques, most notably speculative decoding, which demands token-level equivalence at each step. Across a variety of reasoning benchmarks, SpecReason achieves $1.4-3.0\times$ speedup over vanilla LRM inference while improving accuracy by $0.4-9.0\%$. Compared to speculative decoding without SpecReason, their combination yields an additional $8.8-58.0\%$ latency reduction. We open-source SpecReason at https://github.com/ruipeterpan/specreason.

SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

TL;DR

The paper tackles the latency bottleneck of inference-time reasoning in Large Reasoning Models caused by long chains of thought. It introduces SpecReason, which speculatively offloads easier intermediate reasoning steps to a lightweight model while the base model verifies and corrects as needed. This yields 1.4–3.0× speedups with 0.4–9.0% accuracy gains across reasoning benchmarks, and further latency reductions when combined with speculative decoding. The work provides tunable accuracy-latency tradeoffs via acceptance thresholds and early-forcing options, and is open-sourced for practical adoption.

Abstract

Recent advances in inference-time compute have significantly improved performance on complex tasks by generating long chains of thought (CoTs) using Large Reasoning Models (LRMs). However, this improved accuracy comes at the cost of high inference latency due to the length of generated reasoning sequences and the autoregressive nature of decoding. Our key insight in tackling these overheads is that LRM inference, and the reasoning that it embeds, is highly tolerant of approximations: complex tasks are typically broken down into simpler steps, each of which brings utility based on the semantic insight it provides for downstream steps rather than the exact tokens it generates. Accordingly, we introduce SpecReason, a system that automatically accelerates LRM inference by using a lightweight model to (speculatively) carry out simpler intermediate reasoning steps and reserving the costly base model only to assess (and potentially correct) the speculated outputs. Importantly, SpecReason's focus on exploiting the semantic flexibility of thinking tokens in preserving final-answer accuracy is complementary to prior speculation techniques, most notably speculative decoding, which demands token-level equivalence at each step. Across a variety of reasoning benchmarks, SpecReason achieves speedup over vanilla LRM inference while improving accuracy by . Compared to speculative decoding without SpecReason, their combination yields an additional latency reduction. We open-source SpecReason at https://github.com/ruipeterpan/specreason.

Paper Structure

This paper contains 15 sections, 9 figures.

Figures (9)

  • Figure 1: SpecReason leverages a smaller reasoning model to speculate individual reasoning steps, deferring to the base model only for assessment (and optionally as a fallback), enabling faster yet accurate reasoning. For illustration, we show a math question as an example; our evaluation includes more general reasoning workloads.
  • Figure 2: The spectrum of approximations of one example reasoning step (equation 1 in Fig. \ref{['fig:toy_example']}). SpecReason can control the exactness of reasoning approximations by adjusting its acceptance threshold to navigate through the accuracy-latency tradeoff space (§\ref{['sec:acc_lat_tradeoff']}).
  • Figure 3: Comparison of the accuracy and latency of different schemes on different model combinations. SpecReason significantly reduces latency while improving accuracy over vanilla inference. When combined with speculative decoding, SpecReason outperforms speculative decoding in both latency and accuracy on all datasets and model combinations.
  • Figure 4: [QwQ-32B + Zyphra-1.5B] Intuition behind SpecReason's accuracy improvement. See Fig. \ref{['fig:acc_insight_len_all']} in §\ref{['sec:appendix']} for the full set of results.
  • Figure 5: [QwQ-32B + R1-1.5B] SpecReason allows trading off latency for accuracy via adjusting the acceptance threshold (from left to right, the thresholds are: 3, 5, 7, and 9 out of 9).
  • ...and 4 more figures