REValueD: Regularised Ensemble Value-Decomposition for Factorisable Markov Decision Processes
David Ireland, Giovanni Montana
TL;DR
REValueD tackles high-dimensional discrete action spaces by applying value-decomposition to factorisable MDPs and mitigating the induced variance with an ensemble of critics. The authors show that while DecQN reduces Q-learning's over-estimation bias, it increases target variance, which is countered by an ensemble that reduces variance by a factor of $K$ without altering the bias, and a regularisation loss that addresses cross-dimension credit assignment. Empirically, REValueD outperforms DecQN and BDQ on discretised DM Control Suite tasks, with notable gains in humanoid and dog tasks that have many sub-actions. The work offers a practical, robust framework for scalable, stable learning in FMDPs and motivates future exploration of distributional RL and advanced exploration techniques in this setting.
Abstract
Discrete-action reinforcement learning algorithms often falter in tasks with high-dimensional discrete action spaces due to the vast number of possible actions. A recent advancement leverages value-decomposition, a concept from multi-agent reinforcement learning, to tackle this challenge. This study delves deep into the effects of this value-decomposition, revealing that whilst it curtails the over-estimation bias inherent to Q-learning algorithms, it amplifies target variance. To counteract this, we present an ensemble of critics to mitigate target variance. Moreover, we introduce a regularisation loss that helps to mitigate the effects that exploratory actions in one dimension can have on the value of optimal actions in other dimensions. Our novel algorithm, REValueD, tested on discretised versions of the DeepMind Control Suite tasks, showcases superior performance, especially in the challenging humanoid and dog tasks. We further dissect the factors influencing REValueD's performance, evaluating the significance of the regularisation loss and the scalability of REValueD with increasing sub-actions per dimension.
