Table of Contents
Fetching ...

Evolutionary Policy Optimization

Jianren Wang, Yifan Su, Abhinav Gupta, Deepak Pathak

TL;DR

Evolutionary Policy Optimization (EPO) addresses the stagnation of on-policy reinforcement learning in large-scale, data-rich settings by integrating a population-based genetic algorithm with policy gradients. It maintains a population of latent-conditioned agents that share a single actor–critic network, plus a master agent that learns from aggregated experiences via Split-and-Aggregate Policy Gradient (SAPG). Elites undergo crossover and mutation to generate diverse yet bounded behaviors, while a hybrid update scheme enables the master to learn from both on-policy and off-policy data, stabilized by importance sampling. Across manipulation, locomotion, and control tasks, EPO outperforms state-of-the-art baselines in sample efficiency, asymptotic performance, and scalability, and its scaling law analyses show continued gains with increased parallelism, making it well-suited for data-rich simulation environments and large-scale RL deployments.

Abstract

On-policy reinforcement learning (RL) algorithms are widely used for their strong asymptotic performance and training stability, but they struggle to scale with larger batch sizes, as additional parallel environments yield redundant data due to limited policy-induced diversity. In contrast, Evolutionary Algorithms (EAs) scale naturally and encourage exploration via randomized population-based search, but are often sample-inefficient. We propose Evolutionary Policy Optimization (EPO), a hybrid algorithm that combines the scalability and diversity of EAs with the performance and stability of policy gradients. EPO maintains a population of agents conditioned on latent variables, shares actor-critic network parameters for coherence and memory efficiency, and aggregates diverse experiences into a master agent. Across tasks in dexterous manipulation, legged locomotion, and classic control, EPO outperforms state-of-the-art baselines in sample efficiency, asymptotic performance, and scalability.

Evolutionary Policy Optimization

TL;DR

Evolutionary Policy Optimization (EPO) addresses the stagnation of on-policy reinforcement learning in large-scale, data-rich settings by integrating a population-based genetic algorithm with policy gradients. It maintains a population of latent-conditioned agents that share a single actor–critic network, plus a master agent that learns from aggregated experiences via Split-and-Aggregate Policy Gradient (SAPG). Elites undergo crossover and mutation to generate diverse yet bounded behaviors, while a hybrid update scheme enables the master to learn from both on-policy and off-policy data, stabilized by importance sampling. Across manipulation, locomotion, and control tasks, EPO outperforms state-of-the-art baselines in sample efficiency, asymptotic performance, and scalability, and its scaling law analyses show continued gains with increased parallelism, making it well-suited for data-rich simulation environments and large-scale RL deployments.

Abstract

On-policy reinforcement learning (RL) algorithms are widely used for their strong asymptotic performance and training stability, but they struggle to scale with larger batch sizes, as additional parallel environments yield redundant data due to limited policy-induced diversity. In contrast, Evolutionary Algorithms (EAs) scale naturally and encourage exploration via randomized population-based search, but are often sample-inefficient. We propose Evolutionary Policy Optimization (EPO), a hybrid algorithm that combines the scalability and diversity of EAs with the performance and stability of policy gradients. EPO maintains a population of agents conditioned on latent variables, shares actor-critic network parameters for coherence and memory efficiency, and aggregates diverse experiences into a master agent. Across tasks in dexterous manipulation, legged locomotion, and classic control, EPO outperforms state-of-the-art baselines in sample efficiency, asymptotic performance, and scalability.

Paper Structure

This paper contains 23 sections, 8 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Evolutionary Policy Optimization (EPO) integrates genetic algorithms with policy gradients. A population of agents, represented by latent genes and sharing one actor-critic network, interacts with the environment. A master agent is trained on aggregated experiences from all agents to improve efficiency and stability.
  • Figure 2: Evolutionary Policy Optimization
  • Figure 3: We evaluate our algorithm on eight challenging environments that span a diverse set of tasks, including manipulation petrenko2023dexpbt, locomotion cheng2024extreme, and classic control benchmarks tassa2018deepmind
  • Figure 4: Performance curves of EPO compared to SAC, PQL, PPO, SAPG 64, PBT 8, and CEM-RL baselines. We plot best-performing agent numbers for SAPG (64) and PBT (8). EPO shows superior sample efficiency and higher asymptotic performance, particularly on challenging tasks.
  • Figure 5: Ablation and scaling results for EPO. (a) Ablation of agent number ($K=8,16,32,64$) with a fixed total number of environments ($N=24{,}576$). As the number of agents increases, EPO's performance improves. (b) Training and scaling curves with increasing environments. With a fixed number of environments per agent (384), total performance improves as environments scale.