Table of Contents
Fetching ...

Learning to Play Two-Player Perfect-Information Games without Knowledge

Quentin Cohen-Solal

TL;DR

This work develops Athénan, a zero-knowledge reinforcement-learning framework for two-player perfect-information games that combines non-linear tree learning, a deep-search variant called Descent, completion-based state resolution, and reinforcement heuristics with a novel ordinal action distribution. By unifying these components, Athénan learns game-state evaluations through self-play and achieves state-of-the-art results on multiple domains, notably surpassing Mohex 3HNN in Hex on 11×11 and 13×13 boards, Edax in Othello, and Sharp in Arimaa, all without predefined domain knowledge. The approach also demonstrates strong single-player performance in Morpion Solitaire and competitive Computer Olympiad results, indicating broad applicability to general game-playing tasks. Key contributions include tree learning generalized to non-linear evaluators, a data-generating search (Descent) optimized for learning data quality, a completion framework to leverage resolved states, and reinforcement heuristics that substantially boost learning efficiency. Collectively, the findings suggest that zero-knowledge reinforcement learning within a minimax-inspired framework can rival or exceed knowledge-based or MCTS-based systems across a diverse set of games, with practical implications for General Game Playing and scalable AI.

Abstract

In this paper, several techniques for learning game state evaluation functions by reinforcement are proposed. The first is a generalization of tree bootstrapping (tree learning): it is adapted to the context of reinforcement learning without knowledge based on non-linear functions. With this technique, no information is lost during the reinforcement learning process. The second is a modification of minimax with unbounded depth extending the best sequences of actions to the terminal states. This modified search is intended to be used during the learning process. The third is to replace the classic gain of a game (+1 / -1) with a reinforcement heuristic. We study particular reinforcement heuristics such as: quick wins and slow defeats ; scoring ; mobility or presence. The four is a new action selection distribution. The conducted experiments suggest that these techniques improve the level of play. Finally, we apply these different techniques to design program-players to the game of Hex (size 11 and 13) surpassing the level of Mohex 3HNN with reinforcement learning from self-play without knowledge.

Learning to Play Two-Player Perfect-Information Games without Knowledge

TL;DR

This work develops Athénan, a zero-knowledge reinforcement-learning framework for two-player perfect-information games that combines non-linear tree learning, a deep-search variant called Descent, completion-based state resolution, and reinforcement heuristics with a novel ordinal action distribution. By unifying these components, Athénan learns game-state evaluations through self-play and achieves state-of-the-art results on multiple domains, notably surpassing Mohex 3HNN in Hex on 11×11 and 13×13 boards, Edax in Othello, and Sharp in Arimaa, all without predefined domain knowledge. The approach also demonstrates strong single-player performance in Morpion Solitaire and competitive Computer Olympiad results, indicating broad applicability to general game-playing tasks. Key contributions include tree learning generalized to non-linear evaluators, a data-generating search (Descent) optimized for learning data quality, a completion framework to leverage resolved states, and reinforcement heuristics that substantially boost learning efficiency. Collectively, the findings suggest that zero-knowledge reinforcement learning within a minimax-inspired framework can rival or exceed knowledge-based or MCTS-based systems across a diverse set of games, with practical implications for General Game Playing and scalable AI.

Abstract

In this paper, several techniques for learning game state evaluation functions by reinforcement are proposed. The first is a generalization of tree bootstrapping (tree learning): it is adapted to the context of reinforcement learning without knowledge based on non-linear functions. With this technique, no information is lost during the reinforcement learning process. The second is a modification of minimax with unbounded depth extending the best sequences of actions to the terminal states. This modified search is intended to be used during the learning process. The third is to replace the classic gain of a game (+1 / -1) with a reinforcement heuristic. We study particular reinforcement heuristics such as: quick wins and slow defeats ; scoring ; mobility or presence. The four is a new action selection distribution. The conducted experiments suggest that these techniques improve the level of play. Finally, we apply these different techniques to design program-players to the game of Hex (size 11 and 13) surpassing the level of Mohex 3HNN with reinforcement learning from self-play without knowledge.

Paper Structure

This paper contains 81 sections, 8 equations, 18 figures, 11 tables, 15 algorithms.

Figures (18)

  • Figure 1: Evolutions of the winning percentages of the combinations of the experiment of Section \ref{['subsec:Comparison-data-selection']}, i.e. MCTS (dotted line) or iterative deepening alpha-beta (continuous line) with tree learning (blue line), root learning (red line), or terminal learning (green line). The display uses a simple moving average of 6 data.
  • Figure 2: Evolutions of the winning percentages of the combinations of the experiment of Section \ref{['subsec:Comparison-of-algorithms-for-Learning']}, i.e. of Descent (dashed line), $\mathrm{UBFM}$ (dotted dashed line), MCTS (dotted line), and iterative deepening alpha-beta (continuous line) with tree learning (blue line) or root learning (red line). The display uses a simple moving average of 6 data.
  • Figure 3: The left graph is a game tree where maximizing does not lead to the best decision ; the right graph is the left game tree with completion (nodes are labeled by a pair of values) and thus maximizing leads to the best decision (square node: first player node (max node), circle node: second player node (min node), octagon: terminal node).
  • Figure 4: Evolutions of the winning percentages of the combinations of the experiment of Section \ref{['subsec:Comparison-of-Reinforcement']}, i.e. the use of the following heuristics: classic (black line), score (purple line), additive depth (blue line), multiplicative depth (turquoise line), cumulative mobility (green line), and presence (red line). The display uses a simple moving average of 6 data.
  • Figure 5: Evolutions over 15 training days of performances of the 40 learning repetitions of Athénan and ExIt against base MCTS over all tested games (mean of performance and its stratified bootstrapping 5% confidence interval).
  • ...and 13 more figures

Theorems & Definitions (21)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Remark 6
  • Remark 7
  • Remark 8
  • Remark 9
  • Remark 10
  • ...and 11 more