Table of Contents
Fetching ...

Expert Q-learning: Deep Reinforcement Learning with Coarse State Values from Offline Expert Examples

Li Meng, Anis Yazidi, Morten Goodwin, Paal Engelstad

TL;DR

The baseline Q- learning algorithm exhibits unstable and suboptimal behavior in non-deterministic settings, whereas Expert Q-learning demonstrates more robust performance with higher scores, illustrating that the algorithm is indeed suitable to integrate state values from expert examples intoQ-learning.

Abstract

In this article, we propose a novel algorithm for deep reinforcement learning named Expert Q-learning. Expert Q-learning is inspired by Dueling Q-learning and aims at incorporating semi-supervised learning into reinforcement learning through splitting Q-values into state values and action advantages. We require that an offline expert assesses the value of a state in a coarse manner using three discrete values. An expert network is designed in addition to the Q-network, which updates each time following the regular offline minibatch update whenever the expert example buffer is not empty. Using the board game Othello, we compare our algorithm with the baseline Q-learning algorithm, which is a combination of Double Q-learning and Dueling Q-learning. Our results show that Expert Q-learning is indeed useful and more resistant to the overestimation bias. The baseline Q-learning algorithm exhibits unstable and suboptimal behavior in non-deterministic settings, whereas Expert Q-learning demonstrates more robust performance with higher scores, illustrating that our algorithm is indeed suitable to integrate state values from expert examples into Q-learning.

Expert Q-learning: Deep Reinforcement Learning with Coarse State Values from Offline Expert Examples

TL;DR

The baseline Q- learning algorithm exhibits unstable and suboptimal behavior in non-deterministic settings, whereas Expert Q-learning demonstrates more robust performance with higher scores, illustrating that the algorithm is indeed suitable to integrate state values from expert examples intoQ-learning.

Abstract

In this article, we propose a novel algorithm for deep reinforcement learning named Expert Q-learning. Expert Q-learning is inspired by Dueling Q-learning and aims at incorporating semi-supervised learning into reinforcement learning through splitting Q-values into state values and action advantages. We require that an offline expert assesses the value of a state in a coarse manner using three discrete values. An expert network is designed in addition to the Q-network, which updates each time following the regular offline minibatch update whenever the expert example buffer is not empty. Using the board game Othello, we compare our algorithm with the baseline Q-learning algorithm, which is a combination of Double Q-learning and Dueling Q-learning. Our results show that Expert Q-learning is indeed useful and more resistant to the overestimation bias. The baseline Q-learning algorithm exhibits unstable and suboptimal behavior in non-deterministic settings, whereas Expert Q-learning demonstrates more robust performance with higher scores, illustrating that our algorithm is indeed suitable to integrate state values from expert examples into Q-learning.

Paper Structure

This paper contains 10 sections, 9 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Architecture of our Dueling Q network. The fully connected layer splits into state value and action value branches, and their outputs are combined at the output layer in the Dueling Q network.
  • Figure 2: Architecture of our Expert Q. Convolutional layers have 64 filters with kernel size (3,3) and stride 1. The first convolutional layer uses a padding of 1. A fully connected layer with 512 output, dropout rate of 0.3 and a sigmoid activation function immediately follows. Those layers are combined with batch normalization.
  • Figure 3: Results when playing against different opponents. The top ones are the scores and the bottom ones are initial Q-values, which are the predictions of Q-values of the initial board position by the Q-network. Only the 95% confidence interval of Expert Q scores is plotted for the sake of clarity.