Hard-Thresholding Meets Evolution Strategies in Reinforcement Learning

Chengqian Gao; William de Vazelhes; Hualin Zhang; Bin Gu; Zhiqiang Xu

Hard-Thresholding Meets Evolution Strategies in Reinforcement Learning

Chengqian Gao, William de Vazelhes, Hualin Zhang, Bin Gu, Zhiqiang Xu

TL;DR

The paper tackles the vulnerability of Evolution Strategies to task-irrelevant features by introducing NESHT, which combines Natural Evolution Strategies with a Hard-Thresholding operator to enforce $L_0$ sparsity in policy parameters. It provides a rigorous convergence analysis under boundedness and smoothness assumptions, deriving gradient-estimator error bounds and convergence rates that account for the sparsity constraint. Empirically, NESHT demonstrates robustness to noisy Mujoco environments with sparse rewards and improved performance on pixel-based Atari tasks, outperforming vanilla NES and several RL baselines in many settings. The work offers a principled, scalable approach to feature selection within ES-based reinforcement learning, with practical implications for real-world applications where reward signals are imperfect and observations include substantial irrelevancies.

Abstract

Evolution Strategies (ES) have emerged as a competitive alternative for model-free reinforcement learning, showcasing exemplary performance in tasks like Mujoco and Atari. Notably, they shine in scenarios with imperfect reward functions, making them invaluable for real-world applications where dense reward signals may be elusive. Yet, an inherent assumption in ES, that all input features are task-relevant, poses challenges, especially when confronted with irrelevant features common in real-world problems. This work scrutinizes this limitation, particularly focusing on the Natural Evolution Strategies (NES) variant. We propose NESHT, a novel approach that integrates Hard-Thresholding (HT) with NES to champion sparsity, ensuring only pertinent features are employed. Backed by rigorous analysis and empirical tests, NESHT demonstrates its promise in mitigating the pitfalls of irrelevant features and shines in complex decision-making problems like noisy Mujoco and Atari tasks.

Hard-Thresholding Meets Evolution Strategies in Reinforcement Learning

TL;DR

The paper tackles the vulnerability of Evolution Strategies to task-irrelevant features by introducing NESHT, which combines Natural Evolution Strategies with a Hard-Thresholding operator to enforce

sparsity in policy parameters. It provides a rigorous convergence analysis under boundedness and smoothness assumptions, deriving gradient-estimator error bounds and convergence rates that account for the sparsity constraint. Empirically, NESHT demonstrates robustness to noisy Mujoco environments with sparse rewards and improved performance on pixel-based Atari tasks, outperforming vanilla NES and several RL baselines in many settings. The work offers a principled, scalable approach to feature selection within ES-based reinforcement learning, with practical implications for real-world applications where reward signals are imperfect and observations include substantial irrelevancies.

Abstract

Paper Structure (40 sections, 8 theorems, 51 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 40 sections, 8 theorems, 51 equations, 3 figures, 1 table, 1 algorithm.

Introduction
Preliminaries
Markov decision process
Decision-Making with Irrelevant Features
The objective function
Fitness score
$L_0$-constraint optimization
Why $L_0$ constraint?
Our proposal: NESHT
NES
Hard-thresholding operator
Compatibility concerns
Convergence Analysis
Assumptions
Smoothness
...and 25 more sections

Key Result

Lemma 1

Under Assumption ass:bound, $F_{\sigma}$ is Lipschitz-smooth (i.e. its gradient is Lipschitz-continuous), with a smoothness constant $L = \frac{(d + 1)B}{\sigma^2}$, that is, such $L$ verifies:

Figures (3)

Figure 1: Heatmap illustrating the evolution of learned weights from a NESHT policy (left) and a NES policy (right) over epochs. The environment studied is Hopper-V3, with tenfold Gaussian noise. Among the 11 distinct observation segments (Y-axis), only the first (0-th) segment corresponds to the environment-provided features, while all subsequent 10 segments represent Gaussian noise (task-irrelevant features). The heatmap color indicates the norm of the learned weights. With the HT operator, only the portion of the neuron corresponding to task-relevant features (the 0-th segment) is activated. Without HT, NES struggles with task-irrelevant features, leading to poor performance.
Figure 2: Comparison of NES in four representative Atari tasks with and without HT. We report the outcomes of NESHT under varying hard-thresholding ratios, based on 20 random seeds. Results from the A3C and A2C algorithms are adapted from DBLP:journals/corr/SalimansHCS17.
Figure 3: Ablation study. We assess the impact of hard-thresholding operation in the presence of Gaussian noise with a 20$\times$ noise dimension on the Mujoco environment. We vary the value of $\beta$ from 0.0 (corresponding to Vanilla NES) to 0.95 while retaining only 5% of the neurons.

Theorems & Definitions (23)

Remark 1
Remark 2
Lemma 1
proof
Lemma 2
proof
Example 1
Theorem 1
proof
Remark 3
...and 13 more

Hard-Thresholding Meets Evolution Strategies in Reinforcement Learning

TL;DR

Abstract

Hard-Thresholding Meets Evolution Strategies in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (23)