Table of Contents
Fetching ...

REValueD: Regularised Ensemble Value-Decomposition for Factorisable Markov Decision Processes

David Ireland, Giovanni Montana

TL;DR

REValueD tackles high-dimensional discrete action spaces by applying value-decomposition to factorisable MDPs and mitigating the induced variance with an ensemble of critics. The authors show that while DecQN reduces Q-learning's over-estimation bias, it increases target variance, which is countered by an ensemble that reduces variance by a factor of $K$ without altering the bias, and a regularisation loss that addresses cross-dimension credit assignment. Empirically, REValueD outperforms DecQN and BDQ on discretised DM Control Suite tasks, with notable gains in humanoid and dog tasks that have many sub-actions. The work offers a practical, robust framework for scalable, stable learning in FMDPs and motivates future exploration of distributional RL and advanced exploration techniques in this setting.

Abstract

Discrete-action reinforcement learning algorithms often falter in tasks with high-dimensional discrete action spaces due to the vast number of possible actions. A recent advancement leverages value-decomposition, a concept from multi-agent reinforcement learning, to tackle this challenge. This study delves deep into the effects of this value-decomposition, revealing that whilst it curtails the over-estimation bias inherent to Q-learning algorithms, it amplifies target variance. To counteract this, we present an ensemble of critics to mitigate target variance. Moreover, we introduce a regularisation loss that helps to mitigate the effects that exploratory actions in one dimension can have on the value of optimal actions in other dimensions. Our novel algorithm, REValueD, tested on discretised versions of the DeepMind Control Suite tasks, showcases superior performance, especially in the challenging humanoid and dog tasks. We further dissect the factors influencing REValueD's performance, evaluating the significance of the regularisation loss and the scalability of REValueD with increasing sub-actions per dimension.

REValueD: Regularised Ensemble Value-Decomposition for Factorisable Markov Decision Processes

TL;DR

REValueD tackles high-dimensional discrete action spaces by applying value-decomposition to factorisable MDPs and mitigating the induced variance with an ensemble of critics. The authors show that while DecQN reduces Q-learning's over-estimation bias, it increases target variance, which is countered by an ensemble that reduces variance by a factor of without altering the bias, and a regularisation loss that addresses cross-dimension credit assignment. Empirically, REValueD outperforms DecQN and BDQ on discretised DM Control Suite tasks, with notable gains in humanoid and dog tasks that have many sub-actions. The work offers a practical, robust framework for scalable, stable learning in FMDPs and motivates future exploration of distributional RL and advanced exploration techniques in this setting.

Abstract

Discrete-action reinforcement learning algorithms often falter in tasks with high-dimensional discrete action spaces due to the vast number of possible actions. A recent advancement leverages value-decomposition, a concept from multi-agent reinforcement learning, to tackle this challenge. This study delves deep into the effects of this value-decomposition, revealing that whilst it curtails the over-estimation bias inherent to Q-learning algorithms, it amplifies target variance. To counteract this, we present an ensemble of critics to mitigate target variance. Moreover, we introduce a regularisation loss that helps to mitigate the effects that exploratory actions in one dimension can have on the value of optimal actions in other dimensions. Our novel algorithm, REValueD, tested on discretised versions of the DeepMind Control Suite tasks, showcases superior performance, especially in the challenging humanoid and dog tasks. We further dissect the factors influencing REValueD's performance, evaluating the significance of the regularisation loss and the scalability of REValueD with increasing sub-actions per dimension.
Paper Structure (18 sections, 8 theorems, 22 equations, 13 figures, 6 tables)

This paper contains 18 sections, 8 theorems, 22 equations, 13 figures, 6 tables.

Key Result

Theorem 1

Given the definitions of $Z_s^{dqn}$ and $Z_s^{dec}$ in Equations eq: target diff and eq: decqn target diff, respectively, we have that:

Figures (13)

  • Figure 1: Performance for the Discretised DeepMind Control Suite tasks. We compare REValueD with DecQN and BDQ. The solid line corresponds to the mean of 10 seeds, with the shaded area corresponding to a 95% confidence interval.
  • Figure 2: Here we assess how the performance of DecQN and REValueD are effected by increasing the size of each sub-action space. We conduct experiments on the fish-swim, cheetah-run and dog-walk tasks. $n$ corresponds to the size of the sub-action spaces, i.e.$|\mathcal{A}_i| = n$ for all $i$. The solid line corresponds to the mean of 10 seeds, with the shaded area corresponding to a 95% confidence interval. Further results are given in Figure \ref{['fig: appendix bin size plots']} in Appendix \ref{['sec: further bin size plots']}.
  • Figure 3: Stochastic environment tasks. In the top row we added Gaussian white noise $(\sigma = 0.1)$ to the rewards, whilst in the bottom row we added Gaussian white noise to the state. Further results are given in Figure \ref{['fig: appendix stochastic envs']} in Appendix \ref{['sec: further results in stochastic environments']}.
  • Figure 4: Further results comparing REValueD and DecQN to DecQN using the sum value-decomposition (DecQN-Sum). The solid line corresponds to the mean of 10 seeds, with the shaded area corresponding to a 95% confidence interval.
  • Figure 5: Further results assessing how the performance of DecQN and REValueD are effected by increasing the size of each sub-actions space. $n$ corresponds to the size of the sub-action space, i.e.$|\mathcal{A}_i| = n$ for all $i$. The solid line corresponds to the mean of 10 seeds, with the shaded area corresponding to a 95% confidence interval.
  • ...and 8 more figures

Theorems & Definitions (11)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Theorem 2
  • Theorem 2
  • ...and 1 more