Table of Contents
Fetching ...

Can Large Language Models Play Games? A Case Study of A Self-Play Approach

Hongyi Guo, Zhihan Liu, Yufeng Zhang, Zhaoran Wang

TL;DR

This work addresses decision-making in deterministic turn-based zero-sum games by integrating Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) self-play, without any additional training. The authors formalize the DTZG framework, introduce LLM-driven action pruning and value-function proxying within MCTS, and provide a sublinear suboptimality bound that accounts for the number of simulations and the quality of pruning/criticism. The theoretical analysis, supported by a LogSumExp-based pruning bound, establishes how estimation and pruning errors contribute to overall suboptimality, while experiments in chess puzzles, MiniGo, and full chess games show substantial performance gains over vanilla LLMs and conventional MCTS baselines. The results demonstrate the practical impact of leveraging pre-trained LLMs as planning aids to enhance strategic reasoning in complex games, with implications for broader decision-making tasks requiring strong priors without retraining.

Abstract

Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge. While LLMs have proven beneficial as decision-making aids, their reliability is hampered by limitations in reasoning, hallucination phenomenon, and so on. On the other hand, Monte-Carlo Tree Search (MCTS) is a heuristic search algorithm that provides reliable decision-making solutions, achieved through recursive rollouts and self-play. However, the effectiveness of MCTS relies heavily on heuristic pruning and external value functions, particularly in complex decision scenarios. This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve deterministic turn-based zero-sum games (DTZG), such as chess and go, without the need for additional training. Specifically, we utilize LLMs as both action pruners and proxies for value functions without the need for additional training. We theoretically prove that the suboptimality of the estimated value in our proposed method scales with $\tilde{\mathcal O}\Bigl(\frac{|\tilde {\mathcal A}|}{\sqrt{N}} + ε_\mathrm{pruner} + ε_\mathrm{critic}\Bigr)$, where \(N\) is the number of simulations, $|\tilde {\mathcal A}|$ is the cardinality of the pruned action space by LLM, and $ε_\mathrm{pruner}$ and $ε_\mathrm{critic}$ quantify the errors incurred by adopting LLMs as action space pruner and value function proxy, respectively. Our experiments in chess and go demonstrate the capability of our method to address challenges beyond the scope of MCTS and improve the performance of the directly application of LLMs.

Can Large Language Models Play Games? A Case Study of A Self-Play Approach

TL;DR

This work addresses decision-making in deterministic turn-based zero-sum games by integrating Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) self-play, without any additional training. The authors formalize the DTZG framework, introduce LLM-driven action pruning and value-function proxying within MCTS, and provide a sublinear suboptimality bound that accounts for the number of simulations and the quality of pruning/criticism. The theoretical analysis, supported by a LogSumExp-based pruning bound, establishes how estimation and pruning errors contribute to overall suboptimality, while experiments in chess puzzles, MiniGo, and full chess games show substantial performance gains over vanilla LLMs and conventional MCTS baselines. The results demonstrate the practical impact of leveraging pre-trained LLMs as planning aids to enhance strategic reasoning in complex games, with implications for broader decision-making tasks requiring strong priors without retraining.

Abstract

Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge. While LLMs have proven beneficial as decision-making aids, their reliability is hampered by limitations in reasoning, hallucination phenomenon, and so on. On the other hand, Monte-Carlo Tree Search (MCTS) is a heuristic search algorithm that provides reliable decision-making solutions, achieved through recursive rollouts and self-play. However, the effectiveness of MCTS relies heavily on heuristic pruning and external value functions, particularly in complex decision scenarios. This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve deterministic turn-based zero-sum games (DTZG), such as chess and go, without the need for additional training. Specifically, we utilize LLMs as both action pruners and proxies for value functions without the need for additional training. We theoretically prove that the suboptimality of the estimated value in our proposed method scales with , where is the number of simulations, is the cardinality of the pruned action space by LLM, and and quantify the errors incurred by adopting LLMs as action space pruner and value function proxy, respectively. Our experiments in chess and go demonstrate the capability of our method to address challenges beyond the scope of MCTS and improve the performance of the directly application of LLMs.
Paper Structure (27 sections, 13 theorems, 63 equations, 1 figure, 5 tables, 3 algorithms)

This paper contains 27 sections, 13 theorems, 63 equations, 1 figure, 5 tables, 3 algorithms.

Key Result

Theorem 5.2

Set $\eta_1 = 1/4$ and $\eta_2 = 1/2$ in Algorithm alg:mcts. For any $s_0 \in {\mathcal{S}}$, there exists constants $\{\beta_h\}_{h \in \mathcal{H}}$ such that where $\gamma$ is the discount factor, $H$ is the search depth, $N$ is the number of simulations in Algorithm alg:mcts, $\widetilde{\mathcal{A}}$ is the pruned action space, $\varepsilon_0 = \norm{\widehat{V} - V^\star}_\infty$ is the est

Figures (1)

  • Figure 1: Algorithm illustration for chess. We assume black nodes corresponds to the state where the black player has just made a move, and vice versa for the white nodes.

Theorems & Definitions (14)

  • Theorem 5.2: Estimation Error
  • Lemma 5.3
  • Definition 5.4: Quality of LLMs as pruner
  • Proposition 5.5
  • Corollary 5.6
  • Theorem A.3: Theorem 3, shah2020non
  • Corollary A.4: UCB Min Player
  • Lemma A.5: Leaf
  • Lemma A.6: Max player
  • Lemma A.7: Min player
  • ...and 4 more