Table of Contents
Fetching ...

Suppressing Overestimation in Q-Learning through Adversarial Behaviors

HyeAnn Lee, Donghwan Lee

TL;DR

The proposed dummy adversarial Q-learning (DAQ) is a simple but effective way to suppress the overestimation bias through dummy adversarial behaviors and can be easily applied to off-the-shelf value-based reinforcement learning algorithms to improve the performances.

Abstract

The goal of this paper is to propose a new Q-learning algorithm with a dummy adversarial player, which is called dummy adversarial Q-learning (DAQ), that can effectively regulate the overestimation bias in standard Q-learning. With the dummy player, the learning can be formulated as a two-player zero-sum game. The proposed DAQ unifies several Q-learning variations to control overestimation biases, such as maxmin Q-learning and minmax Q-learning (proposed in this paper) in a single framework. The proposed DAQ is a simple but effective way to suppress the overestimation bias thourgh dummy adversarial behaviors and can be easily applied to off-the-shelf reinforcement learning algorithms to improve the performances. A finite-time convergence of DAQ is analyzed from an integrated perspective by adapting an adversarial Q-learning. The performance of the suggested DAQ is empirically demonstrated under various benchmark environments.

Suppressing Overestimation in Q-Learning through Adversarial Behaviors

TL;DR

The proposed dummy adversarial Q-learning (DAQ) is a simple but effective way to suppress the overestimation bias through dummy adversarial behaviors and can be easily applied to off-the-shelf value-based reinforcement learning algorithms to improve the performances.

Abstract

The goal of this paper is to propose a new Q-learning algorithm with a dummy adversarial player, which is called dummy adversarial Q-learning (DAQ), that can effectively regulate the overestimation bias in standard Q-learning. With the dummy player, the learning can be formulated as a two-player zero-sum game. The proposed DAQ unifies several Q-learning variations to control overestimation biases, such as maxmin Q-learning and minmax Q-learning (proposed in this paper) in a single framework. The proposed DAQ is a simple but effective way to suppress the overestimation bias thourgh dummy adversarial behaviors and can be easily applied to off-the-shelf reinforcement learning algorithms to improve the performances. A finite-time convergence of DAQ is analyzed from an integrated perspective by adapting an adversarial Q-learning. The performance of the suggested DAQ is empirically demonstrated under various benchmark environments.
Paper Structure (23 sections, 1 theorem, 15 equations, 16 figures)

This paper contains 23 sections, 1 theorem, 15 equations, 16 figures.

Key Result

Theorem 1

Let us consider the asynchronous version of DAQ. For any $t \geq 0$, we have where $Q_i$ is the $i$-th estimate at iteration step $t$, and $Q_i^*$ is the optimal Q-function corresponding to the $i$-th estimate.

Figures (16)

  • Figure 1: MDP Environments - (a) Grid World
  • Figure 2: MDP Environments - (b) Sutton's MDP
  • Figure 3: MDP Environments - (c) Weng's MDP
  • Figure 4: Average rewards from grid world environment. DAQs achieve optimal policy for both reward functions. The moving averages with a window size of 100 are shown in the vivid lines.
  • Figure 5: Experiments with Sutton's MDP with $\mu=-0.1$. DAQs highly outperform other algorithms. In each subfigure, the number of episodes are different in order to show the convergence of the algorithms. For DAQs, $b_1=-1$ and $b_2=-2$ were used.
  • ...and 11 more figures

Theorems & Definitions (1)

  • Theorem 1