Table of Contents
Fetching ...

The Update-Equivalence Framework for Decision-Time Planning

Samuel Sokota, Gabriele Farina, David J. Wu, Hengyuan Hu, Kevin A. Wang, J. Zico Kolter, Noam Brown

TL;DR

This work introduces an alternative framework for decision-time planning that is not based on solving subgames, but rather on update equivalence, and derives a provably sound search algorithm for fully cooperative games based on mirror descent and a search algorithm for adversarial games based on magnetic mirror descent.

Abstract

The process of revising (or constructing) a policy at execution time -- known as decision-time planning -- has been key to achieving superhuman performance in perfect-information games like chess and Go. A recent line of work has extended decision-time planning to imperfect-information games, leading to superhuman performance in poker. However, these methods involve solving subgames whose sizes grow quickly in the amount of non-public information, making them unhelpful when the amount of non-public information is large. Motivated by this issue, we introduce an alternative framework for decision-time planning that is not based on solving subgames, but rather on update equivalence. In this update-equivalence framework, decision-time planning algorithms replicate the updates of last-iterate algorithms, which need not rely on public information. This facilitates scalability to games with large amounts of non-public information. Using this framework, we derive a provably sound search algorithm for fully cooperative games based on mirror descent and a search algorithm for adversarial games based on magnetic mirror descent. We validate the performance of these algorithms in cooperative and adversarial domains, notably in Hanabi, the standard benchmark for search in fully cooperative imperfect-information games. Here, our mirror descent approach exceeds or matches the performance of public information-based search while using two orders of magnitude less search time. This is the first instance of a non-public-information-based algorithm outperforming public-information-based approaches in a domain they have historically dominated.

The Update-Equivalence Framework for Decision-Time Planning

TL;DR

This work introduces an alternative framework for decision-time planning that is not based on solving subgames, but rather on update equivalence, and derives a provably sound search algorithm for fully cooperative games based on mirror descent and a search algorithm for adversarial games based on magnetic mirror descent.

Abstract

The process of revising (or constructing) a policy at execution time -- known as decision-time planning -- has been key to achieving superhuman performance in perfect-information games like chess and Go. A recent line of work has extended decision-time planning to imperfect-information games, leading to superhuman performance in poker. However, these methods involve solving subgames whose sizes grow quickly in the amount of non-public information, making them unhelpful when the amount of non-public information is large. Motivated by this issue, we introduce an alternative framework for decision-time planning that is not based on solving subgames, but rather on update equivalence. In this update-equivalence framework, decision-time planning algorithms replicate the updates of last-iterate algorithms, which need not rely on public information. This facilitates scalability to games with large amounts of non-public information. Using this framework, we derive a provably sound search algorithm for fully cooperative games based on mirror descent and a search algorithm for adversarial games based on magnetic mirror descent. We validate the performance of these algorithms in cooperative and adversarial domains, notably in Hanabi, the standard benchmark for search in fully cooperative imperfect-information games. Here, our mirror descent approach exceeds or matches the performance of public information-based search while using two orders of magnitude less search time. This is the first instance of a non-public-information-based algorithm outperforming public-information-based approaches in a domain they have historically dominated.
Paper Structure (23 sections, 3 theorems, 15 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 3 theorems, 15 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Proposition 3.2

Consider a last-iterate algorithm operating with action-value feedback whose update function $\mathcal{U}$ is continuous. Then, as the number of rollouts goes to infinity, the output of alg:global2search, conditioned on $\mathcal{U}$ and given inputs $h_i^t, \pi$, converges in probability to $\mathc

Figures (8)

  • Figure 1: Divergence to AQRE as a function of iterations for last-iterate algorithm analogues of variants of MMD-based search algorithms. Each variant exhibits empirical convergence.
  • Figure 2: Performance of MDS as a function of stepsize in 7-card 4-hint Hanabi. Small step sizes provide less improvement over the blueprint; overly large step sizes cause the search policy to diverge too far from the blueprint, resulting in less improvement or even detriment.
  • Figure 3: Expected return of uniform random and uniform random blueprint + MMDS (left) and MMD(1M) and MMD(1M) blueprint + MMDS (right) versus various opponents in 3x3 Abrupt Dark Hex and Phantom Tic-Tac-Toe. MMDS tends to improve head-to-head expected return.
  • Figure 4: Solving for agent quantal response equilibria using last-iterate algorithm analogues of variants of MMDS.
  • Figure 5: Solving for MiniMaxEnt equilibria using last-iterate algorithm analogues of variants of MMDS.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Definition 3.1: Update equivalence
  • Proposition 3.2
  • proof
  • Theorem 3.3
  • Lemma A.1: Folklore
  • proof
  • proof