Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

Wang Yang; Xiang Yue; Vipin Chaudhary; Xiaotian Han

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

Wang Yang, Xiang Yue, Vipin Chaudhary, Xiaotian Han

TL;DR

Speculative Thinking presents a training-free framework where a larger reasoning model supervises a smaller one during inference by delegating difficult reasoning steps at structurally meaningful points. It leverages reflective cues after paragraph breaks to induce targeted intervention, boosting reasoning accuracy while reducing output length without retraining. Demonstrated gains on MATH500, AIME, GPQA, and AMC23 show that small models can achieve near- or better-than-large-model performance with lightweight mentor guidance, and even non-reasoning models benefit when supervised. The approach offers a practical, scalable path to enhance reasoning in deployment scenarios while maintaining efficiency.

Abstract

Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level. Our approach is based on two observations: (1) reasoning-supportive tokens such as "wait" frequently appear after structural delimiters like "\n\n", serving as signals for reflection or continuation; and (2) larger models exhibit stronger control over reflective behavior, reducing unnecessary backtracking while improving reasoning quality. By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output. With the assistance of the 32B reasoning model, the 1.5B model's accuracy on MATH500 increases from 83.2% to 89.4%, marking a substantial improvement of 6.2%. Simultaneously, the average output length is reduced from 5439 tokens to 4583 tokens, representing a 15.7% decrease. Moreover, when applied to a non-reasoning model (Qwen-2.5-7B-Instruct), our framework boosts its accuracy from 74.0% to 81.8% on the same benchmark, achieving a relative improvement of 7.8%.

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

TL;DR

Abstract

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)