Table of Contents
Fetching ...

Reinforcement Learning: An Overview

Kevin Murphy

TL;DR

This survey synthesizes the major strands of reinforcement learning, outlining how value-based, policy-based, and model-based approaches address the core problem of learning controllers for sequential decision making. It highlights the interplay between planning, learning from data, and dealing with partial observability and model uncertainty, while detailing key algorithms (e.g., Q-learning, DQN, PPO, SAC, MCTS, MuZero, Dreamer) and their theoretical underpinnings. A central theme is improving sample efficiency through world models and offline or off-policy methods, with extensive discussion of practical considerations, reward design, and experimental best practices. The work connects RL to inference, control theory, and game-theoretic MB-RL frameworks, and surveys both foundational ideas and cutting-edge variants that enable scalable, robust decision making in complex environments.

Abstract

This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based methods, policy-based methods, model-based methods, multi-agent RL, LLMs and RL, and various other topics (e.g., offline RL, hierarchical RL, intrinsic reward). It also includes some code snippets for training LLMs with RL.

Reinforcement Learning: An Overview

TL;DR

This survey synthesizes the major strands of reinforcement learning, outlining how value-based, policy-based, and model-based approaches address the core problem of learning controllers for sequential decision making. It highlights the interplay between planning, learning from data, and dealing with partial observability and model uncertainty, while detailing key algorithms (e.g., Q-learning, DQN, PPO, SAC, MCTS, MuZero, Dreamer) and their theoretical underpinnings. A central theme is improving sample efficiency through world models and offline or off-policy methods, with extensive discussion of practical considerations, reward design, and experimental best practices. The work connects RL to inference, control theory, and game-theoretic MB-RL frameworks, and surveys both foundational ideas and cutting-edge variants that enable scalable, robust decision making in complex environments.

Abstract

This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based methods, policy-based methods, model-based methods, multi-agent RL, LLMs and RL, and various other topics (e.g., offline RL, hierarchical RL, intrinsic reward). It also includes some code snippets for training LLMs with RL.

Paper Structure

This paper contains 388 sections, 469 equations, 53 figures, 8 tables, 21 algorithms.

Figures (53)

  • Figure 1: A small agent interacting with a big external world. The observation $o_t$ (which, for notational simplicity, includes the previous action $a_t$) is used to update the internal agent state $z_t$, which is passed to the policy $\pi$ which picks the next action $a_{t+1}$ based on the agent's goal $g_t$. Rewards are computed internally by the agent, by comparing $z_t$ with its internal goal $g_t$. The observations, actions and rewards are stored in a replay buffer, which can be used to learn the policy, a value function (not shown), and optionally an internal world model (for use in model-based RL, see \ref{['sec:MBRL']}).
  • Figure 2: Detailed illustration of the interaction of an agent in an environment. The agent has internal state $z_t$, and chooses action $a_t$ based on its policy $\pi_t$ using $a_t \sim \pi_t(z_t|\theta_t)$. It then predicts its next internal states, $z_{t+1|t}$, via the predict function $P$, and optionally predicts the resulting observation, $\hat{o}_{t+1}$, via the observation decoder $D$. The environment has (hidden) internal state $w_t$, which gets updated by the environment model $M$ to give the new state $w_{t+1} \sim M(w_t,a_t)$ in response to the agent's action. The environment also emits an observation $o_{t+1}$ via its observation model, $o_{t+1} \sim O(w_{t+1})$. This gets encoded to $e_{t+1}=E(o_{t+1})$ by the agent's observation encoder $E$, which the agent uses to update its internal state using $z_{t+1}=U(z_t,a_t,e_{t+1})$. The policy is parameterized by $\theta_t$, and these parameters may be updated (at a slower time scale) by an RL algorithm denoted by $\mathcal{A}$. Square nodes are functions, circles are variables (either random or deterministic), dashed square nodes are stochastic functions that take an extra source of randomness (not shown).
  • Figure 3: Illustration of an MDP as a finite state machine (FSM). The MDP has three discrete states (green cirlces), two discrete actions (orange circles), and two non-zero rewards (orange arrows). The numbers on the black edges represent state transition probabilities, e.g., $p(s'=s_0|a=a_0,s'=s_1)=0.7$; most state transitions are impossible (probability 0), so the graph is sparse. The numbers on the yellow wiggly edges represent expected rewards, e.g., $R(s=s_1, a=a_0, s'=s_0) = +5$; state transitions with zero reward are not annotated. From https://en.wikipedia.org/wiki/Markov_decision_process. Used with kind permission of Wikipedia author waldoalvarez.
  • Figure 4: Illustration of sequential belief updating for a two-armed beta-Bernoulli bandit. The prior for the reward for action 1 is the (blue) uniform distribution $\mathrm{Beta}(1,1)$; the prior for the reward for action 2 is the (orange) unimodal distribution $\mathrm{Beta}(2,2)$. We update the parameters of the belief state based on the chosen action, and based on whether the observed reward is success (1) or failure (0).
  • Figure 5: Illustration of how the MineClip reward function can be used to help train an agent to play Minecraft in the MineDojo simulator. From Figure 4 of MineDojo. Used with kind permission of Jim Fan.
  • ...and 48 more figures