The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations

Matthias Lehmann

The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations

Matthias Lehmann

TL;DR

A holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations, including a detailed proof of the continuous version of the Policy Gradient Theorem, convergence results and a comprehensive discussion of practical algorithms.

Abstract

In recent years, various powerful policy gradient algorithms have been proposed in deep reinforcement learning. While all these algorithms build on the Policy Gradient Theorem, the specific design choices differ significantly across algorithms. We provide a holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations. In this overview, we include a detailed proof of the continuous version of the Policy Gradient Theorem, convergence results and a comprehensive discussion of practical algorithms. We compare the most prominent algorithms on continuous control environments and provide insights on the benefits of regularization. All code is available at https://github.com/Matt00n/PolicyGradientsJax.

The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations

TL;DR

Abstract

Paper Structure (34 sections, 15 theorems, 175 equations, 7 figures, 3 tables, 8 algorithms)

This paper contains 34 sections, 15 theorems, 175 equations, 7 figures, 3 tables, 8 algorithms.

Introduction
Preliminaries
Notation
Reinforcement Learning
Problem Setting
Value Functions
On-Policy Policy Gradient Methods
Deep Learning
Theoretical Foundations of Policy Gradients
Policy Gradient Theorem
Value Function Estimation with Baselines
Importance Sampling
Policy Gradient Algorithms
REINFORCE
A3C
...and 19 more sections

Key Result

Theorem 2.1

(Generalized Policy Iteration) Let $\pi_\text{old}$ be the current policy. Then, Generalized Policy Iteration updates its policy by for all $s \in \mathcal{S}$. Let $\bigl(\pi_n \bigr)^\infty_{n=0}$ be a sequence of policies obtained through Generalized Policy Iteration. Then, this sequence converges to an optimal policy, i.e. and

Figures (7)

Figure 1: Simplified taxonomy of RL algorithms. Subfields of RL we focus on are highlighted in gray.
Figure 2: A neural network with hidden layers of sizes 5 and 4 as a directed graph.
Figure 3: Illustration of the conservative clipping of PPO's objective function, which is shown as a function of the ratio $r_\theta$ for a single transition depending on whether the advantages are positive (a) or negative (b). Replicated from schulman2017proximal.
Figure 4: Comparison of rewards per episode during training on several MuJoCo tasks. For each algorithm, we report means and standard deviations of three runs with different random seeds.
Figure 5: Comparison of the average KL divergence across policies during training.
...and 2 more figures

Theorems & Definitions (30)

Theorem 2.1
Definition 2.2
Definition 2.3
Definition 2.4
Definition 2.5
Theorem 3.1
proof
Definition 3.2
Theorem 4.1
Definition 5.1
...and 20 more

The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations

TL;DR

Abstract

The Definitive Guide to Policy Gradients in Deep Reinforcement Learning: Theory, Algorithms and Implementations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (30)