Table of Contents
Fetching ...

Resource-Efficient Model-Free Reinforcement Learning for Board Games

Kazuki Ota, Takayuki Osa, Motoki Omura, Tatsuya Harada

TL;DR

This study proposes a model-free reinforcement learning algorithm designed for board games to achieve more efficient learning and believes that this efficient algorithm shows the potential of model-free reinforcement learning in domains traditionally dominated by search-based methods.

Abstract

Board games have long served as complex decision-making benchmarks in artificial intelligence. In this field, search-based reinforcement learning methods such as AlphaZero have achieved remarkable success. However, their significant computational demands have been pointed out as barriers to their reproducibility. In this study, we propose a model-free reinforcement learning algorithm designed for board games to achieve more efficient learning. To validate the efficiency of the proposed method, we conducted comprehensive experiments on five board games: Animal Shogi, Gardner Chess, Go, Hex, and Othello. The results demonstrate that the proposed method achieves more efficient learning than existing methods across these environments. In addition, our extensive ablation study shows the importance of core techniques used in the proposed method. We believe that our efficient algorithm shows the potential of model-free reinforcement learning in domains traditionally dominated by search-based methods.

Resource-Efficient Model-Free Reinforcement Learning for Board Games

TL;DR

This study proposes a model-free reinforcement learning algorithm designed for board games to achieve more efficient learning and believes that this efficient algorithm shows the potential of model-free reinforcement learning in domains traditionally dominated by search-based methods.

Abstract

Board games have long served as complex decision-making benchmarks in artificial intelligence. In this field, search-based reinforcement learning methods such as AlphaZero have achieved remarkable success. However, their significant computational demands have been pointed out as barriers to their reproducibility. In this study, we propose a model-free reinforcement learning algorithm designed for board games to achieve more efficient learning. To validate the efficiency of the proposed method, we conducted comprehensive experiments on five board games: Animal Shogi, Gardner Chess, Go, Hex, and Othello. The results demonstrate that the proposed method achieves more efficient learning than existing methods across these environments. In addition, our extensive ablation study shows the importance of core techniques used in the proposed method. We believe that our efficient algorithm shows the potential of model-free reinforcement learning in domains traditionally dominated by search-based methods.
Paper Structure (46 sections, 1 theorem, 18 equations, 14 figures, 10 tables, 1 algorithm)

This paper contains 46 sections, 1 theorem, 18 equations, 14 figures, 10 tables, 1 algorithm.

Key Result

Theorem A.2

Let $\mathcal{A}$ be a finite set, $\pi(a)$ a probability mass function over $\mathcal{A}$, $Q(a): \mathcal{A} \to \mathbb{R}$ a function, and $\Delta$ the set of all probability mass functions over $\mathcal{A}$. Consider the following optimization problem: where $\beta > 0$ and $\alpha > 0$. Then, the optimal solution is given by: where is the normalization constant.

Figures (14)

  • Figure 1: Average performance across the five board games. The proposed method (KLENT) achieves efficient learning compared to existing approaches.
  • Figure 2: Conceptual comparison between AlphaZero and the proposed method KLENT. Search-based methods such as AlphaZero model the policy and the value-function $V(s)$, and use tree search to estimate the Q-function $Q(s,a)$. KLENT, by contrast, directly models both the policy and the Q-function using neural networks, eliminating the need for search.
  • Figure 3: Illustration of difficulty in learning action-value in 9x9 Go. Learning the action-value function is generally more difficult than learning the state-value function, as it often requires handling more complex spatial features.
  • Figure 4: Bias-Variance tradeoff in 9x9 Go. Intermediate $\lambda$ minimizes the sum of squared bias and variance.
  • Figure 5: Performance comparison between the proposed method KLENT and existing methods. KLENT achieves competitive or higher efficiency compared to existing methods.
  • ...and 9 more figures

Theorems & Definitions (3)

  • Definition A.1: KL divergence and entropy
  • Theorem A.2: Formal Derivation of the Analytical Solution $\pi'$
  • proof