Table of Contents
Fetching ...

Superhuman AI for Stratego Using Self-Play Reinforcement Learning and Test-Time Search

Samuel Sokota, Eugene Vinitsky, Hengyuan Hu, J. Zico Kolter, Gabriele Farina

TL;DR

This paper addresses the challenge of achieving superhuman AI in Stratego, a highly imperfect-information game with vast hidden state. It introduces Ataraxos, a framework that combines tabula rasa self-play reinforcement learning with test-time search, implemented via two interdependent Transformer-based networks for setups and moves and a belief network to model hidden pieces. The approach employs dynamically damped learning and a test-time search procedure based on update-equivalence to refine policies under uncertainty. Empirical results show Ataraxos beating the best human Stratego player with an 85% effective win rate in a 20-game series ($p<0.00026$) and achieving a 95% effective win rate in a 40-game exhibition, indicating that large hidden-information problems can be addressed efficiently with modern RL and search techniques.

Abstract

Few classical games have been regarded as such significant benchmarks of artificial intelligence as to have justified training costs in the millions of dollars. Among these, Stratego -- a board wargame exemplifying the challenge of strategic decision making under massive amounts of hidden information -- stands apart as a case where such efforts failed to produce performance at the level of top humans. This work establishes a step change in both performance and cost for Stratego, showing that it is now possible not only to reach the level of top humans, but to achieve vastly superhuman level -- and that doing so requires not an industrial budget, but merely a few thousand dollars. We achieved this result by developing general approaches for self-play reinforcement learning and test-time search under imperfect information.

Superhuman AI for Stratego Using Self-Play Reinforcement Learning and Test-Time Search

TL;DR

This paper addresses the challenge of achieving superhuman AI in Stratego, a highly imperfect-information game with vast hidden state. It introduces Ataraxos, a framework that combines tabula rasa self-play reinforcement learning with test-time search, implemented via two interdependent Transformer-based networks for setups and moves and a belief network to model hidden pieces. The approach employs dynamically damped learning and a test-time search procedure based on update-equivalence to refine policies under uncertainty. Empirical results show Ataraxos beating the best human Stratego player with an 85% effective win rate in a 20-game series () and achieving a 95% effective win rate in a 40-game exhibition, indicating that large hidden-information problems can be addressed efficiently with modern RL and search techniques.

Abstract

Few classical games have been regarded as such significant benchmarks of artificial intelligence as to have justified training costs in the millions of dollars. Among these, Stratego -- a board wargame exemplifying the challenge of strategic decision making under massive amounts of hidden information -- stands apart as a case where such efforts failed to produce performance at the level of top humans. This work establishes a step change in both performance and cost for Stratego, showing that it is now possible not only to reach the level of top humans, but to achieve vastly superhuman level -- and that doing so requires not an industrial budget, but merely a few thousand dollars. We achieved this result by developing general approaches for self-play reinforcement learning and test-time search under imperfect information.

Paper Structure

This paper contains 55 sections, 11 equations.