Table of Contents
Fetching ...

Hindsight Experience Replay Accelerates Proximal Policy Optimization

Douglas C. Crowder, Darrien M. McKenzie, Matthew L. Trappett, Frances S. Chance

TL;DR

Hindsight experience replay (HER) can dramatically accelerate proximal policy optimization (PPO), an on-policy reinforcement learning algorithm, when tested on a custom predator-prey environment.

Abstract

Hindsight experience replay (HER) accelerates off-policy reinforcement learning algorithms for environments that emit sparse rewards by modifying the goal of the episode post-hoc to be some state achieved during the episode. Because post-hoc modification of the observed goal violates the assumptions of on-policy algorithms, HER is not typically applied to on-policy algorithms. Here, we show that HER can dramatically accelerate proximal policy optimization (PPO), an on-policy reinforcement learning algorithm, when tested on a custom predator-prey environment.

Hindsight Experience Replay Accelerates Proximal Policy Optimization

TL;DR

Hindsight experience replay (HER) can dramatically accelerate proximal policy optimization (PPO), an on-policy reinforcement learning algorithm, when tested on a custom predator-prey environment.

Abstract

Hindsight experience replay (HER) accelerates off-policy reinforcement learning algorithms for environments that emit sparse rewards by modifying the goal of the episode post-hoc to be some state achieved during the episode. Because post-hoc modification of the observed goal violates the assumptions of on-policy algorithms, HER is not typically applied to on-policy algorithms. Here, we show that HER can dramatically accelerate proximal policy optimization (PPO), an on-policy reinforcement learning algorithm, when tested on a custom predator-prey environment.

Paper Structure

This paper contains 17 sections, 10 figures.

Figures (10)

  • Figure 1: Custom predator-prey environment.
  • Figure 2: In general, PPO-HER achieves higher rewards than PPO, SAC, or SAC-HER while also being as sample efficient. Bold lines and shaded regions represent the median and interquartile range, respectively.
  • Figure 3: In general, PPO-HER achieves higher rewards than PPO, SAC, or SAC-HER while also being more clock-time efficient. Bold lines and shaded regions represent the median and interquartile range, respectively.
  • Figure 4: PPO-HER is less sensitive to hyperparameters, including the learning rate. Bold lines and shaded regions represent the median and interquartile range, respectively.
  • Figure 5: Results for Fetch environments. Bold lines and shaded regions represent the median and interquartile range, respectively.
  • ...and 5 more figures