Table of Contents
Fetching ...

Improve Value Estimation of Q Function and Reshape Reward with Monte Carlo Tree Search

Jiamian Li

TL;DR

A novel algorithm is proposed that utilizes Monte Carlo Tree Search to average the value estimations in Q function and can be generalized and used in any algorithm which needs Q value estimation, such as the Actor-Critic.

Abstract

Reinforcement learning has achieved remarkable success in perfect information games such as Go and Atari, enabling agents to compete at the highest levels against human players. However, research in reinforcement learning for imperfect information games has been relatively limited due to the more complex game structures and randomness. Traditional methods face challenges in training and improving performance in imperfect information games due to issues like inaccurate Q value estimation and reward sparsity. In this paper, we focus on Uno, an imperfect information game, and aim to address these problems by reducing Q value overestimation and reshaping reward function. We propose a novel algorithm that utilizes Monte Carlo Tree Search to average the value estimations in Q function. Even though we choose Double Deep Q Learning as the foundational framework in this paper, our method can be generalized and used in any algorithm which needs Q value estimation, such as the Actor-Critic. Additionally, we employ Monte Carlo Tree Search to reshape the reward structure in the game environment. We compare our algorithm with several traditional methods applied to games such as Double Deep Q Learning, Deep Monte Carlo and Neural Fictitious Self Play, and the experiments demonstrate that our algorithm consistently outperforms these approaches, especially as the number of players in Uno increases, indicating a higher level of difficulty.

Improve Value Estimation of Q Function and Reshape Reward with Monte Carlo Tree Search

TL;DR

A novel algorithm is proposed that utilizes Monte Carlo Tree Search to average the value estimations in Q function and can be generalized and used in any algorithm which needs Q value estimation, such as the Actor-Critic.

Abstract

Reinforcement learning has achieved remarkable success in perfect information games such as Go and Atari, enabling agents to compete at the highest levels against human players. However, research in reinforcement learning for imperfect information games has been relatively limited due to the more complex game structures and randomness. Traditional methods face challenges in training and improving performance in imperfect information games due to issues like inaccurate Q value estimation and reward sparsity. In this paper, we focus on Uno, an imperfect information game, and aim to address these problems by reducing Q value overestimation and reshaping reward function. We propose a novel algorithm that utilizes Monte Carlo Tree Search to average the value estimations in Q function. Even though we choose Double Deep Q Learning as the foundational framework in this paper, our method can be generalized and used in any algorithm which needs Q value estimation, such as the Actor-Critic. Additionally, we employ Monte Carlo Tree Search to reshape the reward structure in the game environment. We compare our algorithm with several traditional methods applied to games such as Double Deep Q Learning, Deep Monte Carlo and Neural Fictitious Self Play, and the experiments demonstrate that our algorithm consistently outperforms these approaches, especially as the number of players in Uno increases, indicating a higher level of difficulty.

Paper Structure

This paper contains 12 sections, 11 equations, 8 figures, 4 tables, 2 algorithms.

Figures (8)

  • Figure 1: The different kind of Uno cards including the Number card, Skip, Reverse, Draw 2, Wild and Wild Draw 4.
  • Figure 2: The round distribution in the Uno game with 3 players. We record rounds of one hundred thousand games in total. Most games typically end within around 40 - 80 rounds, but there are some games that can extend beyond 200 rounds.
  • Figure 3: The simplified Markov Decision Process (MDP) consists of four states, $S_0 - S_3$, which represent the possible states. The actions, $a_1 - a_3$, denote the decisions the agent makes to transition from one state to another. The transition function, $p$, represents the probability (greater than or equal to 0) of the agent taking a particular action at a state. The reward function, $r$, which can be either positive or negative, defines the immediate reward the agent receives upon transitioning to a new state.
  • Figure 4: The exampled hand encodings. In each plane, the first four rows represent the colors yellow, green, blue, and red, while the columns correspond to the number cards from 0 to 9, as well as the action cards: skip, reverse, draw 2, wild, and wild 4.
  • Figure 5: The starting player is player 1, so we only expand and backpropagate values of player 1's states.
  • ...and 3 more figures