Table of Contents
Fetching ...

Controlling Thinking Speed in Reasoning Models

Zhengkai Lin, Zhihang Fu, Ze Chen, Chao Chen, Liang Xie, Wenxiao Wang, Deng Cai, Zheng Wang, Jieping Ye

TL;DR

This work tackles the challenge of balancing speed and accuracy in Large Reasoning Models by introducing dynamic thinking speed control. It first reveals an intrinsic fast/slow thinking switch in LRMs and then derives a PCA-based steering vector from representation differences to modulate reasoning during inference. The paper then pairs this representation-editing approach with an adaptive, difficulty-aware mechanism that uses logit-based signals to adjust thinking speed in real time, achieving improved accuracy and efficiency across multiple models and benchmarks. Collectively, the methods provide a plug-in framework for faster, simpler reasoning on easy tasks and more thorough, correct analyses for complex problems, with broad implications for scalable AI reasoning systems.

Abstract

Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs' representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-in module delivers an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.

Controlling Thinking Speed in Reasoning Models

TL;DR

This work tackles the challenge of balancing speed and accuracy in Large Reasoning Models by introducing dynamic thinking speed control. It first reveals an intrinsic fast/slow thinking switch in LRMs and then derives a PCA-based steering vector from representation differences to modulate reasoning during inference. The paper then pairs this representation-editing approach with an adaptive, difficulty-aware mechanism that uses logit-based signals to adjust thinking speed in real time, achieving improved accuracy and efficiency across multiple models and benchmarks. Collectively, the methods provide a plug-in framework for faster, simpler reasoning on easy tasks and more thorough, correct analyses for complex problems, with broad implications for scalable AI reasoning systems.

Abstract

Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs' representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-in module delivers an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.

Paper Structure

This paper contains 62 sections, 89 equations, 15 figures, 11 tables, 1 algorithm.

Figures (15)

  • Figure 1: The framework of our thinking speed control method. We utilize the System 1 and System 2 thinking modes of LRMs to sample fast- and slow-thought pairs. We then extract the steering vector that governs the transition between these thinking modes from LRMs, which enables us to steer the LRMs toward either fast-thinking for efficiency or slow-thinking for better accuracy.
  • Figure 2: Leading words statistics on MATH-500 from responses of DeepSeek-R1-Distill-Qwen-7B.
  • Figure 3: Comparison of LRMs' performances with and without leading words restriction. Initiating thought with "To" significantly reduces token usage for LRMs (4.4x reduction) while maintaining comparable performance to regular responses, given the substantial token reduction.
  • Figure 4: Illustration of our representation engineering process. We extract the directional vector corresponding to the transition from slow-thinking to fast-thinking in the representation spaces of LRMs by contrasting fast and slow thoughts. During inference, we strategically inject this vector to manipulate the model's thinking behavior.
  • Figure 5: Scaling effects of thinking speed control. We show the trade-off between response length (x-axis, average token count) and reasoning accuracy (y-axis, Pass@1). Key annotations: (1)"$\star$": Baseline model performance. (2)"$\bullet$$\sim$$\bullet$": Representation control with different steering intensity $\alpha$ (positive to negative). (3)"$\bullet$": Budget-forced early exiting at varying positions. (4)"$\bullet$$\sim$$\bullet$": Thought extrapolation by appending 1x/2x/3x times of "Wait". Our control method consistently shows superior performance compared to baselines. The table version of the above results is presented in \ref{['tab:scaling_law']}.
  • ...and 10 more figures