Thinker: Learning to Think Fast and Slow
Stephen Chung, Wenyu Du, Jie Fu
TL;DR
This work addresses inefficiencies in reinforcement-learning–driven reasoning for large language models by introducing Thinker, a four-stage task that decomposes QA into Fast Thinking (intuition under a tight token budget), Verification, Slow Thinking (deliberate refinement), and Summarization (concise integration). Each stage receives its own reward signal, enabling explicit credit assignment and targeted training of distinct cognitive skills; the approach uses PPO and fixed token budgets, with training performed on a large math QA dataset. Empirical results show consistent accuracy gains on multiple benchmarks across two model families, with notable improvements in both final accuracy and inference efficiency, including a fast mode that uses under $1000$ tokens. The authors also demonstrate reduced reflection tendencies and provide open-source models and code to facilitate adoption and further research in structured reasoning for LLMs.
Abstract
Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 25.6% to 27.3% for Qwen2.5-1.5B, and from 45.9% to 51.0% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 25.2% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training. Additionally, we have open-sourced both the trained models and the source code.
