Table of Contents
Fetching ...

A Tutorial on LLM Reasoning: Relevant Methods behind ChatGPT o1

Jun Wang

TL;DR

Addresses the challenge of enabling robust multi-step reasoning in LLMs by proposing a native chain-of-thought framework (NativeCoT) grounded in a Markov Decision Process. The authors formalize reasoning as Q → {R} → A with a deterministic state transition and a learnable Process-Reward Model to score reasoning steps. They describe practical training loops (STaR data collection and GRPO-style policy optimization) and inference-time strategies (beam search, MCTS) to realize reasoning at test time. The work synthesizes world-modeling, RL-based fine-tuning, and search-based decoding into a unified approach, and sketches open-source directions to accelerate progress in scalable native reasoning.

Abstract

OpenAI o1 has shown that applying reinforcement learning to integrate reasoning steps directly during inference can significantly improve a model's reasoning capabilities. This result is exciting as the field transitions from the conventional autoregressive method of generating answers to a more deliberate approach that models the slow-thinking process through step-by-step reasoning training. Reinforcement learning plays a key role in both the model's training and decoding processes. In this article, we present a comprehensive formulation of reasoning problems and investigate the use of both model-based and model-free approaches to better support this slow-thinking framework.

A Tutorial on LLM Reasoning: Relevant Methods behind ChatGPT o1

TL;DR

Addresses the challenge of enabling robust multi-step reasoning in LLMs by proposing a native chain-of-thought framework (NativeCoT) grounded in a Markov Decision Process. The authors formalize reasoning as Q → {R} → A with a deterministic state transition and a learnable Process-Reward Model to score reasoning steps. They describe practical training loops (STaR data collection and GRPO-style policy optimization) and inference-time strategies (beam search, MCTS) to realize reasoning at test time. The work synthesizes world-modeling, RL-based fine-tuning, and search-based decoding into a unified approach, and sketches open-source directions to accelerate progress in scalable native reasoning.

Abstract

OpenAI o1 has shown that applying reinforcement learning to integrate reasoning steps directly during inference can significantly improve a model's reasoning capabilities. This result is exciting as the field transitions from the conventional autoregressive method of generating answers to a more deliberate approach that models the slow-thinking process through step-by-step reasoning training. Reinforcement learning plays a key role in both the model's training and decoding processes. In this article, we present a comprehensive formulation of reasoning problems and investigate the use of both model-based and model-free approaches to better support this slow-thinking framework.

Paper Structure

This paper contains 15 sections, 27 equations, 6 figures.

Figures (6)

  • Figure 1: Inference-time computation. (a) An autoregressive LLM directly generate an answer (A) by conditioning on the given question (Q). (b) The concept of chain of thought, or step-by-step thinking, involves incorporating intermediate reasoning steps (R) before arriving at the final answer (A). These repeated operations allow for 1) revisiting and revising prior outputs, 2) progressing to subsequent reasoning stages, and 3) exploring multiple reasoning paths or trajectories.
  • Figure 2: An analogy between human cognition and LLMs. (a) and (b) human actions controlled consciously or unconsciously rely on partially distinct brain circuits. (a) Unconscious control in humans is maintained by a few specialised brain regions, such as the anterior insula and the presupplementary motor area (pre-SMA). (b) while voluntary control engages a broader network, activating many regions within the parietal and prefrontal lobes van2010unconscious. Unconscious control is typically fast and instinctive, often driven by automatic processes, whereas conscious control tends to involve more deliberate, computational, and in-depth thinking, allowing for careful reflection and thorough analysis.
  • Figure 3: In this MDP formulation, the LLM is tasked with generating reasoning steps and the final answer to a question in a step-by-step manner. The LLM policy operates by generating tokens, which form higher-level reasoning constructs. The states represent the sequence of reasoning steps so far, and actions correspond to the selection of new reasoning steps or the final answer. The LLM policy governs the choice of actions, and the process-reward model (PRM) provides feedback on the quality of reasoning steps and the final answer. By optimising the policy to maximise the reward, the LLM can be guided by PRM to generate accurate and meaningful reasoning processes.
  • Figure 4: Combining the value function from the PRM with the LLM's policy generation ensures guided and controlled results. During training, the generation produced by the LLM's policy and the evaluation provided by the PRM reinforce each other, leading to continuous self-improvement and refinement of both components.
  • Figure 5: With the PRM, the LLM can perform non-autoregressive reasoning through three approaches: 1) sampling multiple reasoning trajectories, 2) performing a Monte Carlo search over a tree structure of potential reasoning paths, or 3) combining both methods to enhance flexibility and robustness in reasoning.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 1: World Model of LLM
  • Definition 2: Native Chain-of-Thought