Solving Deep Reinforcement Learning Tasks with Evolution Strategies and Linear Policy Networks

Annie Wong; Jacob de Nobel; Thomas Bäck; Aske Plaat; Anna V. Kononova

Solving Deep Reinforcement Learning Tasks with Evolution Strategies and Linear Policy Networks

Annie Wong, Jacob de Nobel, Thomas Bäck, Aske Plaat, Anna V. Kononova

TL;DR

This work challenges the notion that state-of-the-art DRL always requires deep networks by showing that Evolution Strategies can effectively optimize linear policy networks across diverse RL tasks. By comparing ES and ARS with gradient-based methods (DQN, PPO, SAC) on classic control, MuJoCo, and Atari RAM benchmarks, the authors demonstrate that ES often yields competitive or superior performance for simpler policies and can be more wall-clock efficient due to parallelization. The results reveal that many benchmarks may be solvable with simpler representations than commonly assumed, though deep methods still outperform in the most complex environments; the findings encourage exploring neuroevolution and linear policies as practical, interpretable alternatives in RL. The work suggests aligning evaluation benchmarks with policy complexity to fairly assess optimization algorithms and highlights the potential of ES for energy-efficient, scalable RL.

Abstract

Although deep reinforcement learning methods can learn effective policies for challenging problems such as Atari games and robotics tasks, algorithms are complex, and training times are often long. This study investigates how Evolution Strategies perform compared to gradient-based deep reinforcement learning methods. We use Evolution Strategies to optimize the weights of a neural network via neuroevolution, performing direct policy search. We benchmark both deep policy networks and networks consisting of a single linear layer from observations to actions for three gradient-based methods, such as Proximal Policy Optimization. These methods are evaluated against three classical Evolution Strategies and Augmented Random Search, which all use linear policy networks. Our results reveal that Evolution Strategies can find effective linear policies for many reinforcement learning benchmark tasks, unlike deep reinforcement learning methods that can only find successful policies using much larger networks, suggesting that current benchmarks are easier to solve than previously assumed. Interestingly, Evolution Strategies also achieve results comparable to gradient-based deep reinforcement learning algorithms for higher-complexity tasks. Furthermore, we find that by directly accessing the memory state of the game, Evolution Strategies can find successful policies in Atari that outperform the policies found by Deep Q-Learning. Evolution Strategies also outperform Augmented Random Search in most benchmarks, demonstrating superior sample efficiency and robustness in training linear policy networks.

Solving Deep Reinforcement Learning Tasks with Evolution Strategies and Linear Policy Networks

TL;DR

Abstract

Paper Structure (19 sections, 5 equations, 4 figures, 12 tables, 1 algorithm)

This paper contains 19 sections, 5 equations, 4 figures, 12 tables, 1 algorithm.

Introduction
Background and Related Work
Methods
Gradient-Based Algorithms
Deep Q-Learning
Proximal Policy Optimization
Soft Actor-Critic
Evolution Strategies
Augmented Random Search
Network Architecture
Experimental Setup
Classic RL Environments
MuJoCo Simulated Robotics
Atari Learning Environment
Results
...and 4 more sections

Figures (4)

Figure 1: This study investigates how evolution strategies compare to gradient-based reinforcement learning methods in optimizing the weights of linear policies. We use both linear networks as the original DRL architectures to learn policies. We find that ES can learn linear policies for numerous tasks where DRL cannot, and in many instances, even surpasses the performance of the original DRL networks, such as in Swimmer.
Figure 2: Adaptation of the mutation distribution for three different Evolution Strategies for the first ten generations of a two-dimensional quadratic function. Function values are shown with color; darker indicates lower (better). Top row: mutation distribution for CSA-ES; middle row: sep-CMA-ES; bottom row: CMA-ES
Figure 3: Training curves for the CartPole, LunarLander, Swimmer, HalfCheetah, Boxing, and SpaceInvaders environments. Episodic return (calculated using 5 test episodes) vs. the number of training timesteps is shown. Each curve represents the median of 5 trial runs conducted with different random seeds; the shaded area denotes standard deviations. The results show that the ES solve the classic control environments Cartpole and LunarLander almost immediately. ARS takes slightly longer but outperforms the gradient-based methods. Even for the more difficult Swimmer environment, ES and ARS find a linear policy outperforming DRL in terms of timesteps and performance. While SAC outperforms all other methods in Cheetah, linear ES outperforms classic PPO. For the Atari environments, Boxing and Space Invaders, ES is able to learn a linear policy from the RAM input, while linear DQN fails to do so. Only for Boxing does DQN find a successful policy. ARS is able to improve on a policy for Boxing, although it does not perform as well as ES. However, for Space Invaders, ARS fails to learn a policy.
Figure 4: Training curves for the Classic, MuJoCo, and Atari environments. Episodic return (calculated using 5 test episodes) versus the number of training timesteps is shown. Each curve represents the median of 5 trial runs conducted with different random seeds; the shaded area denotes standard deviations. In the classic control environment Acrobat, linear ES and ARS solve the environment within few timesteps, exceeding the performance of gradient-based methods. In contrast, classic SAC excels in the Pendulum task and is the only method that achieves the maximum reward of 0. In the BipedalWalker environment, while classic PPO achieves the optimal reward of 300, both linear ES and linear PPO are not far behind in their performance. While ARS V2 outperforms ARS V1, it is still not as effective as the other methods. In the MuJoCo tasks, ES and ARS-v2 match the performance of the original DRL networks in Hopper, while the linear gradient-based methods struggle to learn a good policy. In Ant, Humanoid, and Walker2d, classic SAC emerges as the dominant method. The ES and ARS-v2 perform comparably to classic PPO, while linear PPO and linear SAC have difficulty finding a good policy. Interestingly, in the Ant environment, linear PPO succeeds in identifying an effective policy and even surpasses the performance of the larger PPO network. In the Atari environments Atlantis, BeamRider, Pong, Crazy Climber, Enduro, Qbert, and Seaquest, linear ES learns effective policies from the game's RAM as policy input. However, linear DQN fails to do the same, except for CrazyClimber, surpassing ES in performance. ARS-v1 outperforms all other methods in BeamRider and Enduro but performs not as well in Pong and Qbert.

Solving Deep Reinforcement Learning Tasks with Evolution Strategies and Linear Policy Networks

TL;DR

Abstract

Solving Deep Reinforcement Learning Tasks with Evolution Strategies and Linear Policy Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)