A Tutorial on LLM Reasoning: Relevant Methods behind ChatGPT o1
Jun Wang
TL;DR
Addresses the challenge of enabling robust multi-step reasoning in LLMs by proposing a native chain-of-thought framework (NativeCoT) grounded in a Markov Decision Process. The authors formalize reasoning as Q → {R} → A with a deterministic state transition and a learnable Process-Reward Model to score reasoning steps. They describe practical training loops (STaR data collection and GRPO-style policy optimization) and inference-time strategies (beam search, MCTS) to realize reasoning at test time. The work synthesizes world-modeling, RL-based fine-tuning, and search-based decoding into a unified approach, and sketches open-source directions to accelerate progress in scalable native reasoning.
Abstract
OpenAI o1 has shown that applying reinforcement learning to integrate reasoning steps directly during inference can significantly improve a model's reasoning capabilities. This result is exciting as the field transitions from the conventional autoregressive method of generating answers to a more deliberate approach that models the slow-thinking process through step-by-step reasoning training. Reinforcement learning plays a key role in both the model's training and decoding processes. In this article, we present a comprehensive formulation of reasoning problems and investigate the use of both model-based and model-free approaches to better support this slow-thinking framework.
