Table of Contents
Fetching ...

Reinforcement Learning from Human Feedback

Nathan Lambert

TL;DR

This work surveys reinforcement learning from human feedback (RLHF) as a practical, post-training paradigm for aligning large language models with human preferences. It articulates the canonical three-stage RLHF pipeline (instruction finetuning, reward modeling, and RLHF optimization) and contrasts it with direct alignment algorithms like DPO, highlighting trade-offs in data, computation, and stability. The text catalogs data collection, reward modeling architectures, regularization strategies, and various RL algorithms (PPO, GRPO, REINFORCE) used in practice, while also addressing synthetic data, evaluation, and broader topics like constitutional AI and tool use. Its synthesis emphasizes RLHF as a broad, evolving framework bridging theory and industrial practice, with implications for reasoning models, AI feedback, and product UX. Overall, the work offers a structured, multi-faceted guide to implementing RLHF and understanding its challenges, opportunities, and open questions in real-world AI systems.

Abstract

Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.

Reinforcement Learning from Human Feedback

TL;DR

This work surveys reinforcement learning from human feedback (RLHF) as a practical, post-training paradigm for aligning large language models with human preferences. It articulates the canonical three-stage RLHF pipeline (instruction finetuning, reward modeling, and RLHF optimization) and contrasts it with direct alignment algorithms like DPO, highlighting trade-offs in data, computation, and stability. The text catalogs data collection, reward modeling architectures, regularization strategies, and various RL algorithms (PPO, GRPO, REINFORCE) used in practice, while also addressing synthetic data, evaluation, and broader topics like constitutional AI and tool use. Its synthesis emphasizes RLHF as a broad, evolving framework bridging theory and industrial practice, with implications for reasoning models, AI feedback, and product UX. Overall, the work offers a structured, multi-faceted guide to implementing RLHF and understanding its challenges, opportunities, and open questions in real-world AI systems.

Abstract

Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.

Paper Structure

This paper contains 154 sections, 106 equations, 22 figures.

Figures (22)

  • Figure 1: A rendition of the early, three stage RLHF process with SFT, a reward model, and then optimization.
  • Figure 2: Standard RL loop
  • Figure 3: Standard RLHF loop
  • Figure 4: A rendition of the early, three stage RLHF process with SFT, a reward model, and then optimization.
  • Figure 5: A rendition of modern post-training with many rounds.
  • ...and 17 more figures