Table of Contents
Fetching ...

Evolutionary Policy Optimization

Zelal Su "Lain" Mustafaoglu, Keshav Pingali, Risto Miikkulainen

TL;DR

Experiments show that EPO improves both policy quality and sample efficiency compared to standard PG and EC methods, offering an efficient solution to the exploration-exploitation dilemma in RL.

Abstract

A key challenge in reinforcement learning (RL) is managing the exploration-exploitation trade-off without sacrificing sample efficiency. Policy gradient (PG) methods excel in exploitation through fine-grained, gradient-based optimization but often struggle with exploration due to their focus on local search. In contrast, evolutionary computation (EC) methods excel in global exploration, but lack mechanisms for exploitation. To address these limitations, this paper proposes Evolutionary Policy Optimization (EPO), a hybrid algorithm that integrates neuroevolution with policy gradient methods for policy optimization. EPO leverages the exploration capabilities of EC and the exploitation strengths of PG, offering an efficient solution to the exploration-exploitation dilemma in RL. EPO is evaluated on the Atari Pong and Breakout benchmarks. Experimental results show that EPO improves both policy quality and sample efficiency compared to standard PG and EC methods, making it effective for tasks that require both exploration and local optimization.

Evolutionary Policy Optimization

TL;DR

Experiments show that EPO improves both policy quality and sample efficiency compared to standard PG and EC methods, offering an efficient solution to the exploration-exploitation dilemma in RL.

Abstract

A key challenge in reinforcement learning (RL) is managing the exploration-exploitation trade-off without sacrificing sample efficiency. Policy gradient (PG) methods excel in exploitation through fine-grained, gradient-based optimization but often struggle with exploration due to their focus on local search. In contrast, evolutionary computation (EC) methods excel in global exploration, but lack mechanisms for exploitation. To address these limitations, this paper proposes Evolutionary Policy Optimization (EPO), a hybrid algorithm that integrates neuroevolution with policy gradient methods for policy optimization. EPO leverages the exploration capabilities of EC and the exploitation strengths of PG, offering an efficient solution to the exploration-exploitation dilemma in RL. EPO is evaluated on the Atari Pong and Breakout benchmarks. Experimental results show that EPO improves both policy quality and sample efficiency compared to standard PG and EC methods, making it effective for tasks that require both exploration and local optimization.

Paper Structure

This paper contains 22 sections, 10 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Training rewards over 10,000 seconds of wall clock time for PPO and EPO on Pong with shaded areas indicating 95% confidence intervals. EPO demonstrates steady and consistent improvement throughout the training process, with decreasing variance over time, indicating stable convergence. In contrast, PPO shows rapid early learning but stagnates later on, with larger variability in performance.
  • Figure 2: Training rewards over 7,200 seconds of wall clock time for PPO and EPO on Breakoutwith shaded areas indicating 95% confidence intervals. EPO demonstrates a consistent upward trajectory in rewards with reduced variance over time. In contrast, PPO exhibits rapid early learning, but higher variance and stagnation in performance.
  • Figure 3: Sample complexity on Breakout across methods after 7,200 seconds of training. EPO and variants are much more sample efficient compared to PPO and pure evolution.
  • Figure 4: Training rewards over 7200 seconds of wall clock time for PPO, EPO, EPO without pre-training, and Evolution on Breakout with shaded regions representing 95% confidence intervals. EPO without pre-training does not exceed EPO, but it outperforms PPO. Evolution without gradient-based optimization fails to achieve competitive performance and stagnates.
  • Figure 5: Training reward over 240 seconds across different pre-training steps for Breakout to illustrate the impact of the pre-training duration on training rewards.
  • ...and 1 more figures