Table of Contents
Fetching ...

Extremum-Seeking Action Selection for Accelerating Policy Optimization

Ya-Chien Chang, Sicun Gao

TL;DR

This work addresses slow learning in continuous-control RL caused by exploration with high-entropy Gaussian policies in unstable dynamics. It introduces Extremum-Seeking Action Selection (ESA), which per-sample perturbs actions using $u(t)=v(t)+K\sin(\omega t)$, evaluates $Q(s,a_t+u(t))$, and updates $v(t+1)=v(t)+\alpha \sin(\omega t)\mathbb{H}[Q(s,a_t+u(t))]$, applying the refined action $a_t+v(t)$ to the environment. ESA is designed as a drop-in augmentation for PPO and SAC, demonstrated to yield faster learning and higher final performance on MuJoCo tasks and quadrotor simulations, with ablations on perturbation magnitude $K$, frequency $\omega$, and decay. The results indicate that ESC-based per-sample refinement offers a practical path to data-efficient, robust learning in challenging continuous-control scenarios, by tracking local optima without explicit models and while preserving base policy gradients.

Abstract

Reinforcement learning for control over continuous spaces typically uses high-entropy stochastic policies, such as Gaussian distributions, for local exploration and estimating policy gradient to optimize performance. Many robotic control problems deal with complex unstable dynamics, where applying actions that are off the feasible control manifolds can quickly lead to undesirable divergence. In such cases, most samples taken from the ambient action space generate low-value trajectories that hardly contribute to policy improvement, resulting in slow or failed learning. We propose to improve action selection in this model-free RL setting by introducing additional adaptive control steps based on Extremum-Seeking Control (ESC). On each action sampled from stochastic policies, we apply sinusoidal perturbations and query for estimated Q-values as the response signal. Based on ESC, we then dynamically improve the sampled actions to be closer to nearby optima before applying them to the environment. Our methods can be easily added in standard policy optimization to improve learning efficiency, which we demonstrate in various control learning environments.

Extremum-Seeking Action Selection for Accelerating Policy Optimization

TL;DR

This work addresses slow learning in continuous-control RL caused by exploration with high-entropy Gaussian policies in unstable dynamics. It introduces Extremum-Seeking Action Selection (ESA), which per-sample perturbs actions using , evaluates , and updates , applying the refined action to the environment. ESA is designed as a drop-in augmentation for PPO and SAC, demonstrated to yield faster learning and higher final performance on MuJoCo tasks and quadrotor simulations, with ablations on perturbation magnitude , frequency , and decay. The results indicate that ESC-based per-sample refinement offers a practical path to data-efficient, robust learning in challenging continuous-control scenarios, by tracking local optima without explicit models and while preserving base policy gradients.

Abstract

Reinforcement learning for control over continuous spaces typically uses high-entropy stochastic policies, such as Gaussian distributions, for local exploration and estimating policy gradient to optimize performance. Many robotic control problems deal with complex unstable dynamics, where applying actions that are off the feasible control manifolds can quickly lead to undesirable divergence. In such cases, most samples taken from the ambient action space generate low-value trajectories that hardly contribute to policy improvement, resulting in slow or failed learning. We propose to improve action selection in this model-free RL setting by introducing additional adaptive control steps based on Extremum-Seeking Control (ESC). On each action sampled from stochastic policies, we apply sinusoidal perturbations and query for estimated Q-values as the response signal. Based on ESC, we then dynamically improve the sampled actions to be closer to nearby optima before applying them to the environment. Our methods can be easily added in standard policy optimization to improve learning efficiency, which we demonstrate in various control learning environments.
Paper Structure (7 sections, 1 theorem, 10 equations, 6 figures, 1 algorithm)

This paper contains 7 sections, 1 theorem, 10 equations, 6 figures, 1 algorithm.

Key Result

Proposition II.1

With appropriate sinusoidal perturbations and the corresponding filters, the estimation $v(t)$ exponentially converges to a local optimum $u^*$ of the objective function $J$ in a neighborhood of $v(0)$.

Figures (6)

  • Figure 1: Diagram for Extremum-Seeking Action Selection (ESA) in the RL setting. We use Extremum-Seeking Control (ESC) strategies to improve the quality of exploratory actions, which reduces the sampling of low-value trajectories and accelerates policy optimization.
  • Figure 2: Illustration of the optimum tracking performance between ESC and PG in Example \ref{['ex']}. It demonstrates that ESC (blue) exhibits the fastest convergence rate in both static and dynamic optimum examples. (a) The convergence speed in tracking a static objective function. (b) Comparison of the convergence in tracking a time-varying objective function. Trajectories show the convergence towards the optimum over time with varying objective values. The initial point is represented by circle dots, and the goal point at time $t=4$ is denoted by a star.
  • Figure 3: An illustration of the effect of using high-pass filters on the Q-value landscapes. (a) A Q-value landscape at a state in the inverted pendulum environment, plotted for a fixed policy $\pi_{\theta}$ at an intermediate stage of training. (b) Filtered Q-value landscape from (a).
  • Figure 4: Illustration of how ESA improves the performance of PPO for quadrotor environment. We evaluated both policies trained after the same number of iterations. We observe that ESA improves the quality of sampled actions and accelerates learning. (a) Quadrotor control environment. (b) Performance comparison in a circle target path task. (c) Performance comparison in tracking an eight-shaped target path, where the PPO-trained policy diverges.
  • Figure 5: Performance comparison for all methods. PPO+ESA (blue, first row) and SAC+ESA (blue, second row) demonstrate higher learning efficiency and performance compared to other methods across all tasks. In comparison, adding random parameter noise (orange) leads to better exploration in the early stages of some tasks, but fails to sustain effective exploration throughout the entire training process.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Proposition II.1: Convergence of ESC escbook
  • Example II.2