Table of Contents
Fetching ...

Search-contempt: a hybrid MCTS algorithm for training AlphaZero-like engines with better computational efficiency

Ameya Joshi

TL;DR

This work introduces search-contempt, a hybrid asymmetric MCTS that blends PUCT with Thompson Sampling via a new control parameter $N_{scl}$. By freezing the visit distribution after $N_{scl}$ visits on the opponent's turns and switching to sampling, the method biases self-play toward more challenging, information-rich positions, improving training data quality and learning efficiency. Empirical results show substantial strength gains in Odds Chess and a significant reduction in compute requirements for AlphaZero-like training, with $N_{scl}$—notably around 5—providing an effective balance between exploration and strength. The approach enables training from zero on consumer hardware by delivering higher-quality self-play data with far fewer games, while also offering a mechanism to generate diverse, puzzle-like positions that challenge neural networks and potentially improve robustness to adversarial strategies.

Abstract

AlphaZero in 2017 was able to master chess and other games without human knowledge by playing millions of games against itself (self-play), with a computation budget running in the tens of millions of dollars. It used a variant of the Monte Carlo Tree Search (MCTS) algorithm, known as PUCT. This paper introduces search-contempt, a novel hybrid variant of the MCTS algorithm that fundamentally alters the distribution of positions generated in self-play, preferring more challenging positions. In addition, search-contempt has been shown to give a big boost in strength for engines in Odds Chess (where one side receives an unfavorable position from the start). More significantly, it opens up the possibility of training a self-play based engine, in a much more computationally efficient manner with the number of training games running into hundreds of thousands, costing tens of thousands of dollars (instead of tens of millions of training games costing millions of dollars required by AlphaZero). This means that it may finally be possible to train such a program from zero on a standard consumer GPU even with a very limited compute, cost, or time budget.

Search-contempt: a hybrid MCTS algorithm for training AlphaZero-like engines with better computational efficiency

TL;DR

This work introduces search-contempt, a hybrid asymmetric MCTS that blends PUCT with Thompson Sampling via a new control parameter . By freezing the visit distribution after visits on the opponent's turns and switching to sampling, the method biases self-play toward more challenging, information-rich positions, improving training data quality and learning efficiency. Empirical results show substantial strength gains in Odds Chess and a significant reduction in compute requirements for AlphaZero-like training, with —notably around 5—providing an effective balance between exploration and strength. The approach enables training from zero on consumer hardware by delivering higher-quality self-play data with far fewer games, while also offering a mechanism to generate diverse, puzzle-like positions that challenge neural networks and potentially improve robustness to adversarial strategies.

Abstract

AlphaZero in 2017 was able to master chess and other games without human knowledge by playing millions of games against itself (self-play), with a computation budget running in the tens of millions of dollars. It used a variant of the Monte Carlo Tree Search (MCTS) algorithm, known as PUCT. This paper introduces search-contempt, a novel hybrid variant of the MCTS algorithm that fundamentally alters the distribution of positions generated in self-play, preferring more challenging positions. In addition, search-contempt has been shown to give a big boost in strength for engines in Odds Chess (where one side receives an unfavorable position from the start). More significantly, it opens up the possibility of training a self-play based engine, in a much more computationally efficient manner with the number of training games running into hundreds of thousands, costing tens of thousands of dollars (instead of tens of millions of training games costing millions of dollars required by AlphaZero). This means that it may finally be possible to train such a program from zero on a standard consumer GPU even with a very limited compute, cost, or time budget.

Paper Structure

This paper contains 14 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: This is a sample snapshot of the state of the tree search using search-contempt. State $s_0$ is the root node. State $s_{1}$ has transitioned to Thompson Sampling (TS) since its $d_s$ value is odd and its visit count $N(s_{0},a_{0})$ exceeded $N_{scl}$ at some point during the search. For $s_{1}$ and all other states which have transitioned their search to TS, the edges of the node separately store $N_{scl}(s,a)$ and $P_{mcts}(s,a)$ for each of its actions. These values were frozen at the point where $N(s_0,a_0) = N_{scl}$ which translates to $(N_{scl}(s_1,a_0) + N_{scl}(s_1,a_1) + N_{scl}(s_1,a_2) == N_{scl})$ always being valid. States $s_2$ and $s_3$ on the other hand still have their search as PUCT since their visit count has not exceeded $N_{scl}$ yet. State $s_{0}$'s visit count has exceeded $N_{scl}$ but it still uses PUCT since its $d_s$ is even, i.e., it is a node corresponding to the player whose turn it currently is. Similar reasoning applies to states $s_4$ - $s_9$ which, therefore, also use PUCT.
  • Figure 2: This shows the plot of $\frac{w+l}{d}$ vs $N_{scl}$ for the search-contempt algorithm in self-play mode. Here, $w$, $l$ and $d$ is the number of wins, losses and draws respectively from the white player's perspective. The total node counts used is 1000 for each move and a total of 100 games are played out for each value of $N_{scl}$. There is a value of $N_{scl}$ which is optimal for generating training games, which occurs roughly at $N_{scl} = 5$ when $\frac{w+l}{d} \approx 1$.
  • Figure 3: This shows the plot of $\frac{w+l}{d}$ vs $\tau$ for self-play with the PUCT-based MCTS algorithm using 1 000 visits for each turn. 100 games are played for each value of $\tau$. The exact values of $\tau$ used is omitted since this value is not fixed for all the moves in a game and so is just a one-dimentional projection of the $\tau$ schedule used for each case. All that can be said is that $\tau_{2} > \tau_{1}$ for any particular move number. Here, $\tau_2$ gives a $\frac{w+l}{d}$ ratio of 1.38 compared to just 0.05 for $\tau_1$. $\tau_2$ is thus a better candidate for generating training games than $\tau_1$.
  • Figure 4: This is a plot of the fraction of repeated games out of a total of 100 games vs move number for each of the self-play experiments presented earlier. A value of 1 means all the 100 games are identical upto that particular move number and a value of 0 means none of the 100 self-play games are repeated. For the case of high temperature, ie $\tau_2$, the repeat rate drops very quickly to 0 by move 4. For $\tau_1$ on the other hand, the repeat rate drops more slowly and all the games are unique by move 30 which is expected since $\tau_2 > \tau_1$. For search-contempt, which uses $\tau_1$ for self-play, the repeat rate is not very different from the PUCT-based MCTS case with $\tau_1$, even for different values of $N_{scl}$, demonstrating that lowering $N_{scl}$ does not negatively impact the repeat rate. Note that for all the cases the games are almost all unique by move 20.
  • Figure 5: This shows 3 consecutive chess positions from a self-play game (game 1 of $N_{scl} = 5$) played using search-contempt and how it deviates from PUCT-based MCTS. Position 1 is roughly balanced with the objective evaluation, Q of 0.0 with the best move of Rd1. (Here Q varies from -1 to 1, with Q=1 referring to white winning, Q=-1 referring to black winning and Q=0 referring to draw or an equal position for both the players). This is what the PUCT-based MCTS prefers as expected. However search-contempt goes for d6 since at the low node count of 5, the evaluation of Position 2 is is slightly favorable for white. This is because the raw neural network (NN) evaluates Position 2 as 0.004, which is a severe misevaluation since it is an objectively losing position with a Q of -0.82. It is only at Position 3 that the NN evaluates it around -0.34 which is closer to the objective value.
  • ...and 1 more figures