Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization
Abdullah Akgül, Gulcin Baykal, Manuel Haußmann, Melih Kandemir
TL;DR
This work tackles non-stationary dynamics in continuous control by learning a distribution over the value function with evidential deep learning, enabling both plasticity preservation of the critic and directed exploration. The proposed Evidential Proximal Policy Optimization (EPPO) augments PPO with an evidential value estimator and probabilistic advantages, yielding two exploration variants (EPPO_cor and EPPO_ind) that incorporate uncertainty via a UCB-like objective $\hat{A}_t^{\text{UCB}} = \mathds{E}[\hat{A}_t^{\text{GAE}}] + \kappa \sqrt{\mathrm{var}[\hat{A}_t^{\text{GAE}}]}$. Key contributions include the first application of evidential value learning to on-policy DRL, a hierarchical Bayesian formulation with four evidential hyperparameters $\boldsymbol{m}=(\omega,\nu,\alpha,\beta)$, and two practical variance-approximation schemes for GAEs to drive directed exploration. Empirical results on structured non-stationary tasks show EPPO variants outperform state-of-the-art PPO baselines in both task-specific and overall performance, with improved plasticity and exploration dynamics. This approach holds promise for real-world, co-adaptive systems where rapid adaptation to changing dynamics is critical, such as robotics and autonomous systems.
Abstract
Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms. The time-dependency of the state transition dynamics aggravates the notorious stability problems of model-free deep actor-critic architectures. We posit that two properties will play a key role in overcoming non-stationarity in transition dynamics: (i)~preserving the plasticity of the critic network and (ii) directed exploration for rapid adaptation to changing dynamics. We show that performing on-policy reinforcement learning with an evidential critic provides both. The evidential design ensures a fast and accurate approximation of the uncertainty around the state value, which maintains the plasticity of the critic network by detecting the distributional shifts caused by changes in dynamics. The probabilistic critic also makes the actor training objective a random variable, enabling the use of directed exploration approaches as a by-product. We name the resulting algorithm \emph{Evidential Proximal Policy Optimization (EPPO)} due to the integral role of evidential uncertainty quantification in both policy evaluation and policy improvement stages. Through experiments on non-stationary continuous control tasks, where the environment dynamics change at regular intervals, we demonstrate that our algorithm outperforms state-of-the-art on-policy reinforcement learning variants in both task-specific and overall return.
