Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks

Niccolò Turcato; Alberto Sinigaglia; Alberto Dalla Libera; Ruggero Carli; Gian Antonio Susto

Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks

Niccolò Turcato, Alberto Sinigaglia, Alberto Dalla Libera, Ruggero Carli, Gian Antonio Susto

TL;DR

This paper designs a Bias Exploiting (BE) mechanism to dynamically select the most advantageous estimation bias during training of the RL agent, and shows that RL algorithms equipped with this method can match or surpass their counterparts, particularly in environments where estimation biases significantly impact learning.

Abstract

Continuous control Deep Reinforcement Learning (RL) approaches are known to suffer from estimation biases, leading to suboptimal policies. This paper introduces innovative methods in RL, focusing on addressing and exploiting estimation biases in Actor-Critic methods for continuous control tasks, using Deep Double Q-Learning. We design a Bias Exploiting (BE) mechanism to dynamically select the most advantageous estimation bias during training of the RL agent. Most State-of-the-art Deep RL algorithms can be equipped with the BE mechanism, without hindering performance or computational complexity. Our extensive experiments across various continuous control tasks demonstrate the effectiveness of our approaches. We show that RL algorithms equipped with this method can match or surpass their counterparts, particularly in environments where estimation biases significantly impact learning. The results underline the importance of bias exploitation in improving policy learning in RL.

Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks

TL;DR

Abstract

Paper Structure (15 sections, 12 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 12 equations, 5 figures, 2 tables, 1 algorithm.

Introduction
Related work
Background
Deterministic Policy Gradient
TD3
Estimation bias in Critic updates with Deep Double Q-Learning
The effect of estimation bias on Clipped Double Q-Learning
Estimation bias in complex dynamics
Exploiting the Estimation Bias
Non-stationary bandit problem
Exploration of estimation bias
Benchmarks and Results
Conclusions and future work
Broader Impact
Networks architectures and hyper-parameters

Figures (5)

Figure 1: Above the custom MDP with continuous action space: when $\mu = +1$, $\sigma = 5$ overestimation is favored, with $\mu = -1$, $\sigma = 5$ underestimation is favored. Middle, returns curves of tested algorithms. Below, is the bias choice of the bandit algorithm in BE-CDQ. Plots and shaded areas indicate respectively mean and half a standard deviation from evaluation across 60 random seeds for simulator and network initializations.
Figure 2: Training progress curves for continuous control tasks in OpenAI Gym, showing the effect of the different target computations in TD3. Plots and shaded areas indicate respectively mean and half a standard deviation from evaluation across 10 random seeds for simulator and network initializations.
Figure 3: Training progress curves showing the effect of the soft $\varepsilon$ reset scheduling in BE-TD3. Plots are from 10 random seeds for simulator and network initializations, smoothed for visualization. Evaluations of Return are performed every 5000 time steps, plots show mean and half a standard deviation, over the seeds. ($\varepsilon_d=0.99$, $\alpha=0.25$)
Figure 4: Comparing bias exploitation in Swimmer, Ant, and Hopper. Plots are from 10 random seeds for simulator and network initializations, smoothed for visualization. Evaluations of Return are performed every 5000 time steps, plots show mean and half a standard deviation, over the seeds. In these experiments, $\varepsilon$ is not reset ($\alpha=0.25$, $\varepsilon_d=0.99$).
Figure 5: Comparing Bias Exploiting TD3 with baselines in continuous control tasks. Plots are from 10 random seeds for simulator and network initializations, smoothed for visualization. Evaluations of Return are performed every 5000 time steps, plots show mean and half a standard deviation of evaluation over 10 episodes. For each environment, we report the probability of the bandits choosing overestimation and $\varepsilon$. (In all BE algorithms $\varepsilon_d=0.99$, $e_r=1500$, $\alpha=0.25$)

Theorems & Definitions (1)

Remark 2.1

Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks

TL;DR

Abstract

Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (1)