Table of Contents
Fetching ...

Look-ahead Search on Top of Policy Networks in Imperfect Information Games

Ondrej Kubicek, Neil Burch, Viliam Lisy

TL;DR

This paper tackles exploitation of policy-gradient agents in imperfect-information games by enabling safe, test-time search without additional training. It introduces SePoT, which pairs a policy-gradient learner with a history critic of policy transformations to perform depth-limited search via gadget games, using V-trace-based off-policy estimates. The approach yields improved performance over Regularized Nash Dynamics in Goofspiel, Battleships, and Leduc hold’em, and demonstrates decreased exploitability in smaller games while preserving scalability by allowing search only in tractable subgames. The work provides a practical framework for integrating search with large-scale imperfect-information policies, bridging value-function estimation with robust, local lookahead.

Abstract

Search in test time is often used to improve the performance of reinforcement learning algorithms. Performing theoretically sound search in fully adversarial two-player games with imperfect information is notoriously difficult and requires a complicated training process. We present a method for adding test-time search to an arbitrary policy-gradient algorithm that learns from sampled trajectories. Besides the policy network, the algorithm trains an additional critic network, which estimates the expected values of players following various transformations of the policies given by the policy network. These values are then used for depth-limited search. We show how the values from this critic can create a value function for imperfect information games. Moreover, they can be used to compute the summary statistics necessary to start the search from an arbitrary decision point in the game. The presented algorithm is scalable to very large games since it does not require any search during train time. We evaluate the algorithm's performance when trained along Regularized Nash Dynamics, and we evaluate the benefit of using the search in the standard benchmark game of Leduc hold'em, multiple variants of imperfect information Goofspiel, and Battleships.

Look-ahead Search on Top of Policy Networks in Imperfect Information Games

TL;DR

This paper tackles exploitation of policy-gradient agents in imperfect-information games by enabling safe, test-time search without additional training. It introduces SePoT, which pairs a policy-gradient learner with a history critic of policy transformations to perform depth-limited search via gadget games, using V-trace-based off-policy estimates. The approach yields improved performance over Regularized Nash Dynamics in Goofspiel, Battleships, and Leduc hold’em, and demonstrates decreased exploitability in smaller games while preserving scalability by allowing search only in tractable subgames. The work provides a practical framework for integrating search with large-scale imperfect-information policies, bridging value-function estimation with robust, local lookahead.

Abstract

Search in test time is often used to improve the performance of reinforcement learning algorithms. Performing theoretically sound search in fully adversarial two-player games with imperfect information is notoriously difficult and requires a complicated training process. We present a method for adding test-time search to an arbitrary policy-gradient algorithm that learns from sampled trajectories. Besides the policy network, the algorithm trains an additional critic network, which estimates the expected values of players following various transformations of the policies given by the policy network. These values are then used for depth-limited search. We show how the values from this critic can create a value function for imperfect information games. Moreover, they can be used to compute the summary statistics necessary to start the search from an arbitrary decision point in the game. The presented algorithm is scalable to very large games since it does not require any search during train time. We evaluate the algorithm's performance when trained along Regularized Nash Dynamics, and we evaluate the benefit of using the search in the standard benchmark game of Leduc hold'em, multiple variants of imperfect information Goofspiel, and Battleships.
Paper Structure (25 sections, 1 theorem, 10 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 25 sections, 1 theorem, 10 equations, 6 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

(Inspired by Vtrace): Let $\rho^t = \min{ (\overline \rho, \frac{\pi(s_i(h^t), a_i^t)}{\mu(s_i(h^t), a_i^t)} )}$, $c^t = \min{ (\overline c, \frac{\pi(s_i(h^t), a_i^t)}{\mu(s_i(h^t), a_i^t)} )}$, $\overline \rho \geq \overline c \geq 1$ and $\mu(s, a) > 0 \;\; \forall s, a$. Let us assume there exis

Figures (6)

  • Figure 1: Current state in a smaller version of Battleships and all possible states sharing the same public information
  • Figure 2: Subgame with 3 leaf public states. Right part shows the detail of public state $s_0^1$ with 4 histories and 2 information sets for each player. The multi-valued states technique modifies this public state by giving player 2 a choice between two strategies in each history against blueprint policy of player 1.
  • Figure 3: Exploitability of policy network RNaD and the search with SePoT based on RNaD training iterations.
  • Figure 4: Exploitability based on the training iterations, when using the counterfactual values from previous searches or computing them from history critic.
  • Figure 5: Exploitability in Goofspiel based on the training iterations, comparing predefined transformations with the neural network transformations
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 1
  • Theorem 1