An agent design with goal reaching guarantees for enhancement of learning
Pavel Osinenko, Grigory Yaremenko, Georgiy Malaniya, Anton Bolychev, Alexander Gepperth
TL;DR
This work addresses reinforcement learning in MDPs with a designated goal region $\\mathbb{G}$ and a probabilistic goal-reaching property $\\Pi_0$. It introduces a critic-based augmentation that preserves the goal-reaching guarantee while accelerating learning: a state-valued critic is updated only when $\\hat{V}^w(s_{t+1}) - \\hat{V}^{w^\\dagger}(s_t^{\\dagger}) > 0$, with a relaxation mechanism $P_{relax}$ decaying at rate $\\lambda_{relax}$ and bounded critic outputs via $-\\hat{\\kappa}_{up}(\\|s\\|) \le \\hat{V}^w(s) \le -\\hat{\\kappa}_{low}(\\|s\\|)$. Theoretical guarantees (Theorem 1) show the policy sequence $\\pi_t$ retains the goal-reaching property with probability at least $1 - \\eta$, and experiments on six environments demonstrate faster learning while achieving final performance on par with or better than strong baselines such as PPO, DDPG, SAC, TD3, and REINFORCE. The approach can be layered on top of any critic-based agent, potentially improving sample efficiency without sacrificing the goal-reaching objective. The authors provide a formal proof and empirical evidence, with open-source code at the referenced repository.
Abstract
Reinforcement learning is commonly concerned with problems of maximizing accumulated rewards in Markov decision processes. Oftentimes, a certain goal state or a subset of the state space attain maximal reward. In such a case, the environment may be considered solved when the goal is reached. Whereas numerous techniques, learning or non-learning based, exist for solving environments, doing so optimally is the biggest challenge. Say, one may choose a reward rate which penalizes the action effort. Reinforcement learning is currently among the most actively developed frameworks for solving environments optimally by virtue of maximizing accumulated reward, in other words, returns. Yet, tuning agents is a notoriously hard task as reported in a series of works. Our aim here is to help the agent learn a near-optimal policy efficiently while ensuring a goal reaching property of some basis policy that merely solves the environment. We suggest an algorithm, which is fairly flexible, and can be used to augment practically any agent as long as it comprises of a critic. A formal proof of a goal reaching property is provided. Comparative experiments on several problems under popular baseline agents provided an empirical evidence that the learning can indeed be boosted while ensuring goal reaching property.
