Table of Contents
Fetching ...

The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations

Matthias Lehmann

TL;DR

A holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations, including a detailed proof of the continuous version of the Policy Gradient Theorem, convergence results and a comprehensive discussion of practical algorithms.

Abstract

In recent years, various powerful policy gradient algorithms have been proposed in deep reinforcement learning. While all these algorithms build on the Policy Gradient Theorem, the specific design choices differ significantly across algorithms. We provide a holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations. In this overview, we include a detailed proof of the continuous version of the Policy Gradient Theorem, convergence results and a comprehensive discussion of practical algorithms. We compare the most prominent algorithms on continuous control environments and provide insights on the benefits of regularization. All code is available at https://github.com/Matt00n/PolicyGradientsJax.

The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations

TL;DR

A holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations, including a detailed proof of the continuous version of the Policy Gradient Theorem, convergence results and a comprehensive discussion of practical algorithms.

Abstract

In recent years, various powerful policy gradient algorithms have been proposed in deep reinforcement learning. While all these algorithms build on the Policy Gradient Theorem, the specific design choices differ significantly across algorithms. We provide a holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations. In this overview, we include a detailed proof of the continuous version of the Policy Gradient Theorem, convergence results and a comprehensive discussion of practical algorithms. We compare the most prominent algorithms on continuous control environments and provide insights on the benefits of regularization. All code is available at https://github.com/Matt00n/PolicyGradientsJax.
Paper Structure (34 sections, 15 theorems, 175 equations, 7 figures, 3 tables, 8 algorithms)

This paper contains 34 sections, 15 theorems, 175 equations, 7 figures, 3 tables, 8 algorithms.

Key Result

Theorem 2.1

(Generalized Policy Iteration) Let $\pi_\text{old}$ be the current policy. Then, Generalized Policy Iteration updates its policy by for all $s \in \mathcal{S}$. Let $\bigl(\pi_n \bigr)^\infty_{n=0}$ be a sequence of policies obtained through Generalized Policy Iteration. Then, this sequence converges to an optimal policy, i.e. and

Figures (7)

  • Figure 1: Simplified taxonomy of RL algorithms. Subfields of RL we focus on are highlighted in gray.
  • Figure 2: A neural network with hidden layers of sizes 5 and 4 as a directed graph.
  • Figure 3: Illustration of the conservative clipping of PPO's objective function, which is shown as a function of the ratio $r_\theta$ for a single transition depending on whether the advantages are positive (a) or negative (b). Replicated from schulman2017proximal.
  • Figure 4: Comparison of rewards per episode during training on several MuJoCo tasks. For each algorithm, we report means and standard deviations of three runs with different random seeds.
  • Figure 5: Comparison of the average KL divergence across policies during training.
  • ...and 2 more figures

Theorems & Definitions (30)

  • Theorem 2.1
  • Definition 2.2
  • Definition 2.3
  • Definition 2.4
  • Definition 2.5
  • Theorem 3.1
  • proof
  • Definition 3.2
  • Theorem 4.1
  • Definition 5.1
  • ...and 20 more