Evolutionary Reinforcement Learning via Cooperative Coevolution

Chengpeng Hu; Jialin Liu; Xin Yao

Evolutionary Reinforcement Learning via Cooperative Coevolution

Chengpeng Hu, Jialin Liu, Xin Yao

TL;DR

This work tackles the scalability of Evolutionary Reinforcement Learning (ERL) in high-dimensional neural networks by introducing CoERL, a cooperative coevolutionary framework that decomposes policy optimization into multiple subproblems and guides updates with partial gradients. It combines a CC loop for subproblem evolution with an off-policy RL loop (SAC) to exploit temporal information, achieving an overall $\mathcal{O}(\lvert\theta\rvert)$ update cost per iteration and improved sample efficiency. Key findings include competitive results across six MuJoCo locomotion tasks, clear ablations showing the value of each core component, and insights into behaviour-space inheritance and coordination strategies. The framework offers a scalable, data-efficient pathway for large-scale ERL, with potential extensions in knowledge-based grouping and explainable decomposition for neural networks.

Abstract

Recently, evolutionary reinforcement learning has obtained much attention in various domains. Maintaining a population of actors, evolutionary reinforcement learning utilises the collected experiences to improve the behaviour policy through efficient exploration. However, the poor scalability of genetic operators limits the efficiency of optimising high-dimensional neural networks.To address this issue, this paper proposes a novel cooperative coevolutionary reinforcement learning (CoERL) algorithm. Inspired by cooperative coevolution, CoERL periodically and adaptively decomposes the policy optimisation problem into multiple subproblems and evolves a population of neural networks for each of the subproblems. Instead of using genetic operators, CoERL directly searches for partial gradients to update the policy. Updating policy with partial gradients maintains consistency between the behaviour spaces of parents and offspring across generations.The experiences collected by the population are then used to improve the entire policy, which enhances the sampling efficiency.Experiments on six benchmark locomotion tasks demonstrate that CoERL outperforms seven state-of-the-art algorithms and baselines.Ablation study verifies the unique contribution of CoERL's core ingredients.

Evolutionary Reinforcement Learning via Cooperative Coevolution

TL;DR

update cost per iteration and improved sample efficiency. Key findings include competitive results across six MuJoCo locomotion tasks, clear ablations showing the value of each core component, and insights into behaviour-space inheritance and coordination strategies. The framework offers a scalable, data-efficient pathway for large-scale ERL, with potential extensions in knowledge-based grouping and explainable decomposition for neural networks.

Abstract

Paper Structure (18 sections, 8 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 8 equations, 5 figures, 3 tables, 1 algorithm.

Background
Preliminary: Markov decision process
Cooperative coevolution
Evolutionary reinforcement learning
Neuroevolution
Evolution-guided policy gradient
Cooperative coevolutionary reinforcement learning
Collaboration between cooperative coevolution and reinforcement learning
Partially updating via cooperative coevolution
Leveraging temporal information
Experiments
Settings
Comparison results
Ablation study
Inheriting behaviour space
...and 3 more sections

Figures (5)

Figure 1: Inconsistency phenomenon between the behaviour spaces of parents and offspring by applying one-point crossover to exchange partially the parameters of Actor 1 (red) and Actor 2 (green). Subfigures show the feature maps of behaviour spaces decompressed by t-SNE van2008visualizing.
Figure 2: Diagram of CoERL. The CoERL algorithm begins by decomposing the policy optimisation problem parameterised by a neural network into multiple subproblems and searching for partial gradients to update the policy. Subsequently, the explored experiences are gathered to further refine the policy outcome using the MDP-based RL.
Figure 3: An example of partial updating via CC. Three subproblems are highlighted in red, green, and blue, respectively. For each subproblem, the policy inherits the partial gradient from the previous subproblem. Eventually, all parameters are updated once and only once.
Figure 4: Training curves on six locomotion tasks. Each algorithm is trained for 1e6 timesteps with five different random seeds.
Figure 5: Four exclusive subproblems are highlighted in red, green, blue and yellow, respectively. The feature maps present the behaviour spaces, reduced by t-SNE van2008visualizing, of the policy after optimised in subproblems.

Evolutionary Reinforcement Learning via Cooperative Coevolution

TL;DR

Abstract

Evolutionary Reinforcement Learning via Cooperative Coevolution

Authors

TL;DR

Abstract

Table of Contents

Figures (5)