Table of Contents
Fetching ...

Minimax Strikes Back

Quentin Cohen-Solal, Tristan Cazenave

TL;DR

The paper evaluates Athénan, a policy-free reinforcement-learning framework built on Descent Minimax and Unbounded Minimax, against Polygames/AlphaZero in zero-knowledge complete-information games. It demonstrates that Athénan generates far more learning data via tree learning (about $296$x) and achieves substantially higher win rates against MCTS baselines, especially when augmented with a reinforcement heuristic, while using far fewer computational resources. In a long-term Hex 13 study against Mohex 2.0, Athénan with the reinforcement heuristic surpasses the strongest public Hex program, while Polygames fails to beat it. Overall, the results show that minimax-based search with deep, tree-based evaluation can outperform policy-based MCTS approaches under realistic hardware constraints, offering a viable and efficient alternative for zero-knowledge learning in complete-information games, with broad implications for game-playing AI.

Abstract

Deep Reinforcement Learning reaches a superhuman level of play in many complete information games. The state of the art algorithm for learning with zero knowledge is AlphaZero. We take another approach, Athénan, which uses a different, Minimax-based, search algorithm called Descent, as well as different learning targets and that does not use a policy. We show that for multiple games it is much more efficient than the reimplementation of AlphaZero: Polygames. It is even competitive with Polygames when Polygames uses 100 times more GPU (at least for some games). One of the keys to the superior performance is that the cost of generating state data for training is approximately 296 times lower with Athénan. With the same reasonable ressources, Athénan without reinforcement heuristic is at least 7 times faster than Polygames and much more than 30 times faster with reinforcement heuristic.

Minimax Strikes Back

TL;DR

The paper evaluates Athénan, a policy-free reinforcement-learning framework built on Descent Minimax and Unbounded Minimax, against Polygames/AlphaZero in zero-knowledge complete-information games. It demonstrates that Athénan generates far more learning data via tree learning (about x) and achieves substantially higher win rates against MCTS baselines, especially when augmented with a reinforcement heuristic, while using far fewer computational resources. In a long-term Hex 13 study against Mohex 2.0, Athénan with the reinforcement heuristic surpasses the strongest public Hex program, while Polygames fails to beat it. Overall, the results show that minimax-based search with deep, tree-based evaluation can outperform policy-based MCTS approaches under realistic hardware constraints, offering a viable and efficient alternative for zero-knowledge learning in complete-information games, with broad implications for game-playing AI.

Abstract

Deep Reinforcement Learning reaches a superhuman level of play in many complete information games. The state of the art algorithm for learning with zero knowledge is AlphaZero. We take another approach, Athénan, which uses a different, Minimax-based, search algorithm called Descent, as well as different learning targets and that does not use a policy. We show that for multiple games it is much more efficient than the reimplementation of AlphaZero: Polygames. It is even competitive with Polygames when Polygames uses 100 times more GPU (at least for some games). One of the keys to the superior performance is that the cost of generating state data for training is approximately 296 times lower with Athénan. With the same reasonable ressources, Athénan without reinforcement heuristic is at least 7 times faster than Polygames and much more than 30 times faster with reinforcement heuristic.

Paper Structure

This paper contains 18 sections, 6 figures, 4 tables, 3 algorithms.

Figures (6)

  • Figure 1: Performance of Athénan (resp. Polygames) against MCTS with UCT at the end of the $15$ days of training averaged over the $8$ games. Their stratified bootstrap confidence intervals are indicated by the black lines. Athénan results are detailed in function of the use of a reinforcement heuristic (abbreviated R.H.).
  • Figure 2: Average win percentages minus loss percentages of Athénan (resp. Polygames) against MCTS with UCT at the end of the $15$ days of training for the 8 games. Athénan results are detailed in function of the use of a reinforcement heuristic (abbreviated R.H.). Their bootstrap confidence intervals are indicated by the black lines.
  • Figure 3: Evolution of average win rates minus average loss rates of Athénan with reinforcement heuristic (with R.H.), of Athénan without reinforcement heuristic, and of Polygames against MCTS with UCT along the 15 days of training and their stratified bootstrap confidence intervals over the $8$ games.
  • Figure 4: Evolution of average win rates of Athénan with and without reinforcement heuristic (R.H.) against Mohex 2.0, during 113 days of training (there is approximately one evaluation every 4 days ; each evaluation consists of 50 matches in first player and 50 other matches in second player). Shading is the $95\%$ confidence interval.
  • Figure 5: Results of $400$ matches between Athénan ($5$ days of training) and Polygames (using tournaments Polygames networks) at Breakthrough, Othello $8$, and Othello $10$.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition 1