Suppressing Overestimation in Q-Learning through Adversarial Behaviors

HyeAnn Lee; Donghwan Lee

Suppressing Overestimation in Q-Learning through Adversarial Behaviors

HyeAnn Lee, Donghwan Lee

TL;DR

The proposed dummy adversarial Q-learning (DAQ) is a simple but effective way to suppress the overestimation bias through dummy adversarial behaviors and can be easily applied to off-the-shelf value-based reinforcement learning algorithms to improve the performances.

Abstract

The goal of this paper is to propose a new Q-learning algorithm with a dummy adversarial player, which is called dummy adversarial Q-learning (DAQ), that can effectively regulate the overestimation bias in standard Q-learning. With the dummy player, the learning can be formulated as a two-player zero-sum game. The proposed DAQ unifies several Q-learning variations to control overestimation biases, such as maxmin Q-learning and minmax Q-learning (proposed in this paper) in a single framework. The proposed DAQ is a simple but effective way to suppress the overestimation bias thourgh dummy adversarial behaviors and can be easily applied to off-the-shelf reinforcement learning algorithms to improve the performances. A finite-time convergence of DAQ is analyzed from an integrated perspective by adapting an adversarial Q-learning. The performance of the suggested DAQ is empirically demonstrated under various benchmark environments.

Suppressing Overestimation in Q-Learning through Adversarial Behaviors

TL;DR

Abstract

Paper Structure (23 sections, 1 theorem, 15 equations, 16 figures)

This paper contains 23 sections, 1 theorem, 15 equations, 16 figures.

Introduction
Related Works
Preliminaries
Markov Decision Process
Two-Player Zero-Sum Markov Game
Controlling Overestimation Biases
Maxmin Q-Learning
Double Q-Learning
Twin-Delayed Deep Deterministic Policy Gradient (TD3)
Proposed Algorithms
Minmax Q-Learning
Dummy Adversarial Q-Learning (DAQ)
Discussion on Asynchronous versus Synchronous Updates
Interpretation from the Two-Player Zero-Sum Game
Finite-Time Convergence Analysis
...and 8 more sections

Key Result

Theorem 1

Let us consider the asynchronous version of DAQ. For any $t \geq 0$, we have where $Q_i$ is the $i$-th estimate at iteration step $t$, and $Q_i^*$ is the optimal Q-function corresponding to the $i$-th estimate.

Figures (16)

Figure 1: MDP Environments - (a) Grid World
Figure 2: MDP Environments - (b) Sutton's MDP
Figure 3: MDP Environments - (c) Weng's MDP
Figure 4: Average rewards from grid world environment. DAQs achieve optimal policy for both reward functions. The moving averages with a window size of 100 are shown in the vivid lines.
Figure 5: Experiments with Sutton's MDP with $\mu=-0.1$. DAQs highly outperform other algorithms. In each subfigure, the number of episodes are different in order to show the convergence of the algorithms. For DAQs, $b_1=-1$ and $b_2=-2$ were used.
...and 11 more figures

Theorems & Definitions (1)

Theorem 1

Suppressing Overestimation in Q-Learning through Adversarial Behaviors

TL;DR

Abstract

Suppressing Overestimation in Q-Learning through Adversarial Behaviors

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (1)