Table of Contents
Fetching ...

Thinker: Learning to Think Fast and Slow

Stephen Chung, Wenyu Du, Jie Fu

TL;DR

This work addresses inefficiencies in reinforcement-learning–driven reasoning for large language models by introducing Thinker, a four-stage task that decomposes QA into Fast Thinking (intuition under a tight token budget), Verification, Slow Thinking (deliberate refinement), and Summarization (concise integration). Each stage receives its own reward signal, enabling explicit credit assignment and targeted training of distinct cognitive skills; the approach uses PPO and fixed token budgets, with training performed on a large math QA dataset. Empirical results show consistent accuracy gains on multiple benchmarks across two model families, with notable improvements in both final accuracy and inference efficiency, including a fast mode that uses under $1000$ tokens. The authors also demonstrate reduced reflection tendencies and provide open-source models and code to facilitate adoption and further research in structured reasoning for LLMs.

Abstract

Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 25.6% to 27.3% for Qwen2.5-1.5B, and from 45.9% to 51.0% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 25.2% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training. Additionally, we have open-sourced both the trained models and the source code.

Thinker: Learning to Think Fast and Slow

TL;DR

This work addresses inefficiencies in reinforcement-learning–driven reasoning for large language models by introducing Thinker, a four-stage task that decomposes QA into Fast Thinking (intuition under a tight token budget), Verification, Slow Thinking (deliberate refinement), and Summarization (concise integration). Each stage receives its own reward signal, enabling explicit credit assignment and targeted training of distinct cognitive skills; the approach uses PPO and fixed token budgets, with training performed on a large math QA dataset. Empirical results show consistent accuracy gains on multiple benchmarks across two model families, with notable improvements in both final accuracy and inference efficiency, including a fast mode that uses under tokens. The authors also demonstrate reduced reflection tendencies and provide open-source models and code to facilitate adoption and further research in structured reasoning for LLMs.

Abstract

Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 25.6% to 27.3% for Qwen2.5-1.5B, and from 45.9% to 51.0% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 25.2% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training. Additionally, we have open-sourced both the trained models and the source code.

Paper Structure

This paper contains 23 sections, 12 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Conceptual model of the interaction between Fast Thinking and Slow Thinking modes in the Thinker task, based on Dual Process Theory.
  • Figure 2: The four-step Thinker task. Each stage involves a user prompt, model response, and specific rewards and transition conditions designed to train distinct agent capabilities (intuition, evaluation, refinement, and integration). Reward function details are in the main text.
  • Figure 3: Accuracy on training set. For the Qwen2.5-1.5B model (a), the shaded region represents the standard deviation across three independent seeds.
  • Figure 4: Average response length on training set. For the Qwen2.5-1.5B model (a), the shaded region represents the standard deviation across three independent seeds.
  • Figure 5: Evaluation performance across seven common benchmarks. For the Qwen2.5-1.5B model (a), the shaded region represents the standard deviation across three independent seeds.
  • ...and 6 more figures