Table of Contents
Fetching ...

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

Wang Yang, Xiang Yue, Vipin Chaudhary, Xiaotian Han

TL;DR

Speculative Thinking presents a training-free framework where a larger reasoning model supervises a smaller one during inference by delegating difficult reasoning steps at structurally meaningful points. It leverages reflective cues after paragraph breaks to induce targeted intervention, boosting reasoning accuracy while reducing output length without retraining. Demonstrated gains on MATH500, AIME, GPQA, and AMC23 show that small models can achieve near- or better-than-large-model performance with lightweight mentor guidance, and even non-reasoning models benefit when supervised. The approach offers a practical, scalable path to enhance reasoning in deployment scenarios while maintaining efficiency.

Abstract

Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level. Our approach is based on two observations: (1) reasoning-supportive tokens such as "wait" frequently appear after structural delimiters like "\n\n", serving as signals for reflection or continuation; and (2) larger models exhibit stronger control over reflective behavior, reducing unnecessary backtracking while improving reasoning quality. By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output. With the assistance of the 32B reasoning model, the 1.5B model's accuracy on MATH500 increases from 83.2% to 89.4%, marking a substantial improvement of 6.2%. Simultaneously, the average output length is reduced from 5439 tokens to 4583 tokens, representing a 15.7% decrease. Moreover, when applied to a non-reasoning model (Qwen-2.5-7B-Instruct), our framework boosts its accuracy from 74.0% to 81.8% on the same benchmark, achieving a relative improvement of 7.8%.

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

TL;DR

Speculative Thinking presents a training-free framework where a larger reasoning model supervises a smaller one during inference by delegating difficult reasoning steps at structurally meaningful points. It leverages reflective cues after paragraph breaks to induce targeted intervention, boosting reasoning accuracy while reducing output length without retraining. Demonstrated gains on MATH500, AIME, GPQA, and AMC23 show that small models can achieve near- or better-than-large-model performance with lightweight mentor guidance, and even non-reasoning models benefit when supervised. The approach offers a practical, scalable path to enhance reasoning in deployment scenarios while maintaining efficiency.

Abstract

Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level. Our approach is based on two observations: (1) reasoning-supportive tokens such as "wait" frequently appear after structural delimiters like "\n\n", serving as signals for reflection or continuation; and (2) larger models exhibit stronger control over reflective behavior, reducing unnecessary backtracking while improving reasoning quality. By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output. With the assistance of the 32B reasoning model, the 1.5B model's accuracy on MATH500 increases from 83.2% to 89.4%, marking a substantial improvement of 6.2%. Simultaneously, the average output length is reduced from 5439 tokens to 4583 tokens, representing a 15.7% decrease. Moreover, when applied to a non-reasoning model (Qwen-2.5-7B-Instruct), our framework boosts its accuracy from 74.0% to 81.8% on the same benchmark, achieving a relative improvement of 7.8%.

Paper Structure

This paper contains 19 sections, 2 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Speculative Thinking significantly improves the 1.5B model’s reasoning accuracy while simultaneously reducing its average output length. This figure compares the accuracy and average output length of models on four mathematical and reasoning datasets, including AIME 2020–2024, MATH500, GPQA, and AMC23. "1.5B" denotes the Deepseek-Distilled Qwen 2.5-1.5B model, "32B" refers to the Deepseek-Distilled Qwen 2.5-32B model, and "1.5B+32B" represents our proposed Speculative Thinking method, where the 32B model supervises reflective reasoning steps of the 1.5B model during inference.
  • Figure 2: Comparison of outputs between Reasoning Model and Non-reasoning model. Reasoning models often generate negative sentences—typically containing tokens such as “wait”—immediately following the delimiter "\\ n\\ n". These sentences serve as reflective prompts, helping the model to backtrack, reassess, and verify prior reasoning steps.
  • Figure 3: Accuracy and output statistics of three models on the AIME 2022–2024 dataset. Reported metrics include: overall accuracy (upper left), average output length (upper right), average output length (down left) for correct and incorrect answers, as well as the number of reflective sentences—such as those containing terms like “wait” or “alternatively”—in both correct and incorrect responses (down right). “#=67” indicates the number of incorrect responses made by the 1.5B model is 67. The average output length of small models is significantly higher than that of large models. This is primarily due to the excessive length of incorrect responses. At its core, this phenomenon stems from inefficient and redundant self-reflection in small models, which often leads to failed reasoning attempts and ultimately prevents them from arriving at correct answers before its max output length.
  • Figure 4: Overview of speculative thinking. A small model generates most output but selectively delegates challenging segments—marked by structural cues such as paragraph breaks ("\\ n\\ n") followed by reflective phrases like “wait,” “alternatively,” or “hold on”—to a stronger model. Small models often produce verbose or incoherent outputs at these points, while larger models handle them concisely. The proposed speculative thinking preserves efficiency while leveraging the large model’s strength when most needed.
  • Figure 5: A comparison between the prefix and decode stages reveals that the time (in seconds) required to process multiple tokens during the prefix phase is nearly equivalent to the time taken to decode a single token.
  • ...and 6 more figures