Playing Board Games with the Predict Results of Beam Search Algorithm
Sergey Pastukhov
TL;DR
PROBS tackles deterministic, two-player perfect-information games by predicting beam-search outcomes through two neural nets: a value network $V_\theta(s)$ for terminal utilities and an action-value network $Q_\phi(s,a)$ for beam-search subtree results. The method trains these networks iteratively via self-play, softmax-based action selection with Dirichlet exploration, and replacing beam-search leaves with $V_\theta$, while using $Q_\phi$ to bias expansion. Unlike MCTS-centric approaches like AlphaZero, PROBS relies on a fixed-depth beam search and demonstrates that limited yet focused exploration can progressively refine the policy and improve performance. Empirical results on Connect Four and additional games show notable improvements in Elo against simple baselines, with the ability to outperform deeper lookahead strategies even when the beam is shallow, though direct comparisons to AlphaZero are left for future work. Overall, PROBS presents a novel framework that couples neural prediction with beam-search planning, offering a potential alternative path for planning in deterministic games and inspiring broader applications.
Abstract
This paper introduces a novel algorithm for two-player deterministic games with perfect information, which we call PROBS (Predict Results of Beam Search). Unlike existing methods that predominantly rely on Monte Carlo Tree Search (MCTS) for decision processes, our approach leverages a simpler beam search algorithm. We evaluate the performance of our algorithm across a selection of board games, where it consistently demonstrates an increased winning ratio against baseline opponents. A key result of this study is that the PROBS algorithm operates effectively, even when the beam search size is considerably smaller than the average number of turns in the game.
