Table of Contents
Fetching ...

Harnessing Bounded-Support Evolution Strategies for Policy Refinement

Ethan Hirschowitz, Fabio Ramos

TL;DR

The paper tackles refining competent robotic policies when gradient signals are weak by introducing a two-stage PPO→TD-ES workflow. TD-ES uses bounded-support symmetric triangular perturbations and a centered-rank finite-difference estimator to provide a stable, gradient-free refinement that concentrates exploration near the current policy, yielding a trust-region–like effect without backpropagation. Empirically, across three robotic manipulation tasks, TD-ES achieves higher aggregate success rates than PPO and Gaussian ES, while consistently reducing variance and improving reliability, particularly in precision-demanding tasks. The approach is compute-light, embarrassingly parallel, and shows strong potential for robust policy refinement in robotics and similar domains.

Abstract

Improving competent robot policies with on-policy RL is often hampered by noisy, low-signal gradients. We revisit Evolution Strategies (ES) as a policy-gradient proxy and localize exploration with bounded, antithetic triangular perturbations, suitable for policy refinement. We propose Triangular-Distribution ES (TD-ES) which pairs bounded triangular noise with a centered-rank finite-difference estimator to deliver stable, parallelizable, gradient-free updates. In a two-stage pipeline - PPO pretraining followed by TD-ES refinement - this preserves early sample efficiency while enabling robust late-stage gains. Across a suite of robotic manipulation tasks, TD-ES raises success rates by 26.5% relative to PPO and greatly reduces variance, offering a simple, compute-light path to reliable refinement.

Harnessing Bounded-Support Evolution Strategies for Policy Refinement

TL;DR

The paper tackles refining competent robotic policies when gradient signals are weak by introducing a two-stage PPO→TD-ES workflow. TD-ES uses bounded-support symmetric triangular perturbations and a centered-rank finite-difference estimator to provide a stable, gradient-free refinement that concentrates exploration near the current policy, yielding a trust-region–like effect without backpropagation. Empirically, across three robotic manipulation tasks, TD-ES achieves higher aggregate success rates than PPO and Gaussian ES, while consistently reducing variance and improving reliability, particularly in precision-demanding tasks. The approach is compute-light, embarrassingly parallel, and shows strong potential for robust policy refinement in robotics and similar domains.

Abstract

Improving competent robot policies with on-policy RL is often hampered by noisy, low-signal gradients. We revisit Evolution Strategies (ES) as a policy-gradient proxy and localize exploration with bounded, antithetic triangular perturbations, suitable for policy refinement. We propose Triangular-Distribution ES (TD-ES) which pairs bounded triangular noise with a centered-rank finite-difference estimator to deliver stable, parallelizable, gradient-free updates. In a two-stage pipeline - PPO pretraining followed by TD-ES refinement - this preserves early sample efficiency while enabling robust late-stage gains. Across a suite of robotic manipulation tasks, TD-ES raises success rates by 26.5% relative to PPO and greatly reduces variance, offering a simple, compute-light path to reliable refinement.

Paper Structure

This paper contains 27 sections, 12 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Parameter space exploration comparison between Gaussian ES and TD-ES across generations. Both methods search from a PPO checkpoint toward higher-reward regions. Gaussian ES (top row) uses unbounded perturbations that spread widely, while TD-ES (bottom row) employs bounded triangular perturbations for localized exploration. Orange points indicate candidates in high-reward regions, grey points in lower-reward areas. TD-ES achieves more focused exploration with higher sample efficiency in beneficial regions.
  • Figure 2: Theoretical triangular distribution PDF overlaid with a subset of our actual samples (histogram).
  • Figure 3: Relative reduction in gradient estimator variance achieved by triangular perturbations compared to Gaussian perturbations during ES refinement. The y-axis shows the percentage by which triangular ES reduces variance relative to Gaussian ES at each generation, computed from multiple independent gradient estimates. Positive values indicate lower variance for triangular perturbations. The bounded support of triangular distributions consistently reduces estimator variance throughout the refinement process, demonstrating the stabilizing effect of localized parameter-space exploration.
  • Figure 4: Robotic manipulation tasks: (a) Lift-Cube with 36D observations and 4096 environments, (b) Open-Drawer with 31D observations and 4096 environments, (c) Peg-Insert with 19D observations and 512 environments due to contact modeling demands.
  • Figure 5: Individual run success rates. Our approach shows reduced variance across all tasks, with particularly tight clustering on precision-demanding tasks (Open-Drawer, Peg-Insert).
  • ...and 1 more figures