Evolutionary Policy Optimization

Jianren Wang; Yifan Su; Abhinav Gupta; Deepak Pathak

Evolutionary Policy Optimization

Jianren Wang, Yifan Su, Abhinav Gupta, Deepak Pathak

TL;DR

Evolutionary Policy Optimization (EPO) addresses the stagnation of on-policy reinforcement learning in large-scale, data-rich settings by integrating a population-based genetic algorithm with policy gradients. It maintains a population of latent-conditioned agents that share a single actor–critic network, plus a master agent that learns from aggregated experiences via Split-and-Aggregate Policy Gradient (SAPG). Elites undergo crossover and mutation to generate diverse yet bounded behaviors, while a hybrid update scheme enables the master to learn from both on-policy and off-policy data, stabilized by importance sampling. Across manipulation, locomotion, and control tasks, EPO outperforms state-of-the-art baselines in sample efficiency, asymptotic performance, and scalability, and its scaling law analyses show continued gains with increased parallelism, making it well-suited for data-rich simulation environments and large-scale RL deployments.

Abstract

On-policy reinforcement learning (RL) algorithms are widely used for their strong asymptotic performance and training stability, but they struggle to scale with larger batch sizes, as additional parallel environments yield redundant data due to limited policy-induced diversity. In contrast, Evolutionary Algorithms (EAs) scale naturally and encourage exploration via randomized population-based search, but are often sample-inefficient. We propose Evolutionary Policy Optimization (EPO), a hybrid algorithm that combines the scalability and diversity of EAs with the performance and stability of policy gradients. EPO maintains a population of agents conditioned on latent variables, shares actor-critic network parameters for coherence and memory efficiency, and aggregates diverse experiences into a master agent. Across tasks in dexterous manipulation, legged locomotion, and classic control, EPO outperforms state-of-the-art baselines in sample efficiency, asymptotic performance, and scalability.

Evolutionary Policy Optimization

TL;DR

Abstract

Evolutionary Policy Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)