Table of Contents
Fetching ...

Symmetric Q-learning: Reducing Skewness of Bellman Error in Online Reinforcement Learning

Motoki Omura, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada

TL;DR

A method called Symmetric Q-learning is proposed, in which the synthetic noise generated from a zero-mean distribution is added to the target values to generate a Gaussian error distribution, which improved the sample efficiency of a state-of-the-art reinforcement learning method by reducing the skewness of the error distribution.

Abstract

In deep reinforcement learning, estimating the value function to evaluate the quality of states and actions is essential. The value function is often trained using the least squares method, which implicitly assumes a Gaussian error distribution. However, a recent study suggested that the error distribution for training the value function is often skewed because of the properties of the Bellman operator, and violates the implicit assumption of normal error distribution in the least squares method. To address this, we proposed a method called Symmetric Q-learning, in which the synthetic noise generated from a zero-mean distribution is added to the target values to generate a Gaussian error distribution. We evaluated the proposed method on continuous control benchmark tasks in MuJoCo. It improved the sample efficiency of a state-of-the-art reinforcement learning method by reducing the skewness of the error distribution.

Symmetric Q-learning: Reducing Skewness of Bellman Error in Online Reinforcement Learning

TL;DR

A method called Symmetric Q-learning is proposed, in which the synthetic noise generated from a zero-mean distribution is added to the target values to generate a Gaussian error distribution, which improved the sample efficiency of a state-of-the-art reinforcement learning method by reducing the skewness of the error distribution.

Abstract

In deep reinforcement learning, estimating the value function to evaluate the quality of states and actions is essential. The value function is often trained using the least squares method, which implicitly assumes a Gaussian error distribution. However, a recent study suggested that the error distribution for training the value function is often skewed because of the properties of the Bellman operator, and violates the implicit assumption of normal error distribution in the least squares method. To address this, we proposed a method called Symmetric Q-learning, in which the synthetic noise generated from a zero-mean distribution is added to the target values to generate a Gaussian error distribution. We evaluated the proposed method on continuous control benchmark tasks in MuJoCo. It improved the sample efficiency of a state-of-the-art reinforcement learning method by reducing the skewness of the error distribution.
Paper Structure (28 sections, 18 equations, 41 figures, 6 tables, 2 algorithms)

This paper contains 28 sections, 18 equations, 41 figures, 6 tables, 2 algorithms.

Figures (41)

  • Figure 1: Bellman error, negative values of correction noise, and corrected Bellman error from Symmetric REDQ on Hopper-v2. Left: The blue histogram shows the distribution of Bellman errors. The orange histogram represents the distribution of the negative noise added to reduce skewness. It can be observed that the noise distribution fits well with the negative Bellman errors. Right: The green histogram represents the distribution of Bellman errors after adding correction noise. The skewness decreased compared to the blue distribution.
  • Figure 2: The pre-corrected Bellman error at three different steps when learning Walker2d with SymREDQ.
  • Figure 3: Comparison of SymSAC, SAC and $\mathcal{X}$-SAC without ensembles for UTD=1
  • Figure 4: Comparison of SymREDQ, REDQ and $\mathcal{X}$-REDQ for UTD=20.
  • Figure 5: The top figure illustrates the density of pre-corrected Bellman error (blue) and negative values of noise used for correction (orange). It shows how closely the distribution of $\eta$ approaches the distribution of $- \epsilon$. The bottom figure shows the density of the post-corrected error (green), which is the sum of pre-corrected error and noise. This demonstrates the extent to which the distribution approached a symmetric distribution, and the corrected distribution (green) is more symmetric than the pre-corrected distribution (blue).
  • ...and 36 more figures