Table of Contents
Fetching ...

Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization

Abdullah Akgül, Gulcin Baykal, Manuel Haußmann, Melih Kandemir

TL;DR

This work tackles non-stationary dynamics in continuous control by learning a distribution over the value function with evidential deep learning, enabling both plasticity preservation of the critic and directed exploration. The proposed Evidential Proximal Policy Optimization (EPPO) augments PPO with an evidential value estimator and probabilistic advantages, yielding two exploration variants (EPPO_cor and EPPO_ind) that incorporate uncertainty via a UCB-like objective $\hat{A}_t^{\text{UCB}} = \mathds{E}[\hat{A}_t^{\text{GAE}}] + \kappa \sqrt{\mathrm{var}[\hat{A}_t^{\text{GAE}}]}$. Key contributions include the first application of evidential value learning to on-policy DRL, a hierarchical Bayesian formulation with four evidential hyperparameters $\boldsymbol{m}=(\omega,\nu,\alpha,\beta)$, and two practical variance-approximation schemes for GAEs to drive directed exploration. Empirical results on structured non-stationary tasks show EPPO variants outperform state-of-the-art PPO baselines in both task-specific and overall performance, with improved plasticity and exploration dynamics. This approach holds promise for real-world, co-adaptive systems where rapid adaptation to changing dynamics is critical, such as robotics and autonomous systems.

Abstract

Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms. The time-dependency of the state transition dynamics aggravates the notorious stability problems of model-free deep actor-critic architectures. We posit that two properties will play a key role in overcoming non-stationarity in transition dynamics: (i)~preserving the plasticity of the critic network and (ii) directed exploration for rapid adaptation to changing dynamics. We show that performing on-policy reinforcement learning with an evidential critic provides both. The evidential design ensures a fast and accurate approximation of the uncertainty around the state value, which maintains the plasticity of the critic network by detecting the distributional shifts caused by changes in dynamics. The probabilistic critic also makes the actor training objective a random variable, enabling the use of directed exploration approaches as a by-product. We name the resulting algorithm \emph{Evidential Proximal Policy Optimization (EPPO)} due to the integral role of evidential uncertainty quantification in both policy evaluation and policy improvement stages. Through experiments on non-stationary continuous control tasks, where the environment dynamics change at regular intervals, we demonstrate that our algorithm outperforms state-of-the-art on-policy reinforcement learning variants in both task-specific and overall return.

Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization

TL;DR

This work tackles non-stationary dynamics in continuous control by learning a distribution over the value function with evidential deep learning, enabling both plasticity preservation of the critic and directed exploration. The proposed Evidential Proximal Policy Optimization (EPPO) augments PPO with an evidential value estimator and probabilistic advantages, yielding two exploration variants (EPPO_cor and EPPO_ind) that incorporate uncertainty via a UCB-like objective . Key contributions include the first application of evidential value learning to on-policy DRL, a hierarchical Bayesian formulation with four evidential hyperparameters , and two practical variance-approximation schemes for GAEs to drive directed exploration. Empirical results on structured non-stationary tasks show EPPO variants outperform state-of-the-art PPO baselines in both task-specific and overall performance, with improved plasticity and exploration dynamics. This approach holds promise for real-world, co-adaptive systems where rapid adaptation to changing dynamics is critical, such as robotics and autonomous systems.

Abstract

Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms. The time-dependency of the state transition dynamics aggravates the notorious stability problems of model-free deep actor-critic architectures. We posit that two properties will play a key role in overcoming non-stationarity in transition dynamics: (i)~preserving the plasticity of the critic network and (ii) directed exploration for rapid adaptation to changing dynamics. We show that performing on-policy reinforcement learning with an evidential critic provides both. The evidential design ensures a fast and accurate approximation of the uncertainty around the state value, which maintains the plasticity of the critic network by detecting the distributional shifts caused by changes in dynamics. The probabilistic critic also makes the actor training objective a random variable, enabling the use of directed exploration approaches as a by-product. We name the resulting algorithm \emph{Evidential Proximal Policy Optimization (EPPO)} due to the integral role of evidential uncertainty quantification in both policy evaluation and policy improvement stages. Through experiments on non-stationary continuous control tasks, where the environment dynamics change at regular intervals, we demonstrate that our algorithm outperforms state-of-the-art on-policy reinforcement learning variants in both task-specific and overall return.

Paper Structure

This paper contains 52 sections, 25 equations, 12 figures, 13 tables, 1 algorithm.

Figures (12)

  • Figure 1: PPO, its non-stationary extension, and PPO equipped with directed exploration all lose their adaptation capability after 1 million steps. In contrast, evidential PPO variants continue to improve, and directed exploration further enhances evidential PPO’s performance. See \ref{['appsec:result_visualizations']} for details.
  • Figure 2: Plate diagram of our evidential value learning model.
  • Figure 3: Plasticity preservation analysis using critic network metrics. We evaluate three metrics: effective rank, stable rank, and dormant unit percentage, shown from left to right. The top row shows results from the slippery environments, and the bottom row shows results from the paralysis environments. Each box plot summarizes the distribution of the respective metric across training seeds: the red line indicates the mean, the black line indicates the median, and the individual points represent outliers. These metrics quantify the prediction capacity of the critic networks as learning progresses. EPPO variants consistently preserve plasticity better than PPO variants, as shown by higher ranks and lower dormant unit percentages.
  • Figure 4: $p$-values for effective rank in slippery.
  • Figure 5: Plasticity preservation analysis for the slippery experiment.
  • ...and 7 more figures