Table of Contents
Fetching ...

Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks

Niccolò Turcato, Alberto Sinigaglia, Alberto Dalla Libera, Ruggero Carli, Gian Antonio Susto

TL;DR

This paper designs a Bias Exploiting (BE) mechanism to dynamically select the most advantageous estimation bias during training of the RL agent, and shows that RL algorithms equipped with this method can match or surpass their counterparts, particularly in environments where estimation biases significantly impact learning.

Abstract

Continuous control Deep Reinforcement Learning (RL) approaches are known to suffer from estimation biases, leading to suboptimal policies. This paper introduces innovative methods in RL, focusing on addressing and exploiting estimation biases in Actor-Critic methods for continuous control tasks, using Deep Double Q-Learning. We design a Bias Exploiting (BE) mechanism to dynamically select the most advantageous estimation bias during training of the RL agent. Most State-of-the-art Deep RL algorithms can be equipped with the BE mechanism, without hindering performance or computational complexity. Our extensive experiments across various continuous control tasks demonstrate the effectiveness of our approaches. We show that RL algorithms equipped with this method can match or surpass their counterparts, particularly in environments where estimation biases significantly impact learning. The results underline the importance of bias exploitation in improving policy learning in RL.

Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks

TL;DR

This paper designs a Bias Exploiting (BE) mechanism to dynamically select the most advantageous estimation bias during training of the RL agent, and shows that RL algorithms equipped with this method can match or surpass their counterparts, particularly in environments where estimation biases significantly impact learning.

Abstract

Continuous control Deep Reinforcement Learning (RL) approaches are known to suffer from estimation biases, leading to suboptimal policies. This paper introduces innovative methods in RL, focusing on addressing and exploiting estimation biases in Actor-Critic methods for continuous control tasks, using Deep Double Q-Learning. We design a Bias Exploiting (BE) mechanism to dynamically select the most advantageous estimation bias during training of the RL agent. Most State-of-the-art Deep RL algorithms can be equipped with the BE mechanism, without hindering performance or computational complexity. Our extensive experiments across various continuous control tasks demonstrate the effectiveness of our approaches. We show that RL algorithms equipped with this method can match or surpass their counterparts, particularly in environments where estimation biases significantly impact learning. The results underline the importance of bias exploitation in improving policy learning in RL.
Paper Structure (15 sections, 12 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 12 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Above the custom MDP with continuous action space: when $\mu = +1$, $\sigma = 5$ overestimation is favored, with $\mu = -1$, $\sigma = 5$ underestimation is favored. Middle, returns curves of tested algorithms. Below, is the bias choice of the bandit algorithm in BE-CDQ. Plots and shaded areas indicate respectively mean and half a standard deviation from evaluation across 60 random seeds for simulator and network initializations.
  • Figure 2: Training progress curves for continuous control tasks in OpenAI Gym, showing the effect of the different target computations in TD3. Plots and shaded areas indicate respectively mean and half a standard deviation from evaluation across 10 random seeds for simulator and network initializations.
  • Figure 3: Training progress curves showing the effect of the soft $\varepsilon$ reset scheduling in BE-TD3. Plots are from 10 random seeds for simulator and network initializations, smoothed for visualization. Evaluations of Return are performed every 5000 time steps, plots show mean and half a standard deviation, over the seeds. ($\varepsilon_d=0.99$, $\alpha=0.25$)
  • Figure 4: Comparing bias exploitation in Swimmer, Ant, and Hopper. Plots are from 10 random seeds for simulator and network initializations, smoothed for visualization. Evaluations of Return are performed every 5000 time steps, plots show mean and half a standard deviation, over the seeds. In these experiments, $\varepsilon$ is not reset ($\alpha=0.25$, $\varepsilon_d=0.99$).
  • Figure 5: Comparing Bias Exploiting TD3 with baselines in continuous control tasks. Plots are from 10 random seeds for simulator and network initializations, smoothed for visualization. Evaluations of Return are performed every 5000 time steps, plots show mean and half a standard deviation of evaluation over 10 episodes. For each environment, we report the probability of the bandits choosing overestimation and $\varepsilon$. (In all BE algorithms $\varepsilon_d=0.99$, $e_r=1500$, $\alpha=0.25$)

Theorems & Definitions (1)

  • Remark 2.1