Table of Contents
Fetching ...

Efficiently Training Neural Networks for Imperfect Information Games by Sampling Information Sets

Timo Bertram, Johannes Fürnkranz, Martin Müller

TL;DR

The paper addresses valuing imperfect-information states by averaging over information-set configurations under a fixed budget of perfect-information evaluations. It studies how to allocate evaluations between the number of training examples $n$ and per-example target quality $k$, formalizing targets as $\hat{y}=\sum_{\mathbf{h}} P(\mathbf{h}|\mathbf{x}) f(\mathbf{x},\mathbf{h})$ with $N=nk$. Empirical results in Heads-Up Poker and Reconnaissance Blind Chess show that distributing evaluations across many samples with moderate $k$ (notably $k$ around 2–10) generally yields better learning efficiency and accuracy than high-quality targets from few samples. The findings provide practical guidance for training imperfection-information evaluators under computational budgets and suggest that the optimal sampling strategy may extend to other domains with information-set evaluations.

Abstract

In imperfect information games, the evaluation of a game state not only depends on the observable world but also relies on hidden parts of the environment. As accessing the obstructed information trivialises state evaluations, one approach to tackle such problems is to estimate the value of the imperfect state as a combination of all states in the information set, i.e., all possible states that are consistent with the current imperfect information. In this work, the goal is to learn a function that maps from the imperfect game information state to its expected value. However, constructing a perfect training set, i.e. an enumeration of the whole information set for numerous imperfect states, is often infeasible. To compute the expected values for an imperfect information game like \textit{Reconnaissance Blind Chess}, one would need to evaluate thousands of chess positions just to obtain the training target for a single state. Still, the expected value of a state can already be approximated with appropriate accuracy from a much smaller set of evaluations. Thus, in this paper, we empirically investigate how a budget of perfect information game evaluations should be distributed among training samples to maximise the return. Our results show that sampling a small number of states, in our experiments roughly 3, for a larger number of separate positions is preferable over repeatedly sampling a smaller quantity of states. Thus, we find that in our case, the quantity of different samples seems to be more important than higher target quality.

Efficiently Training Neural Networks for Imperfect Information Games by Sampling Information Sets

TL;DR

The paper addresses valuing imperfect-information states by averaging over information-set configurations under a fixed budget of perfect-information evaluations. It studies how to allocate evaluations between the number of training examples and per-example target quality , formalizing targets as with . Empirical results in Heads-Up Poker and Reconnaissance Blind Chess show that distributing evaluations across many samples with moderate (notably around 2–10) generally yields better learning efficiency and accuracy than high-quality targets from few samples. The findings provide practical guidance for training imperfection-information evaluators under computational budgets and suggest that the optimal sampling strategy may extend to other domains with information-set evaluations.

Abstract

In imperfect information games, the evaluation of a game state not only depends on the observable world but also relies on hidden parts of the environment. As accessing the obstructed information trivialises state evaluations, one approach to tackle such problems is to estimate the value of the imperfect state as a combination of all states in the information set, i.e., all possible states that are consistent with the current imperfect information. In this work, the goal is to learn a function that maps from the imperfect game information state to its expected value. However, constructing a perfect training set, i.e. an enumeration of the whole information set for numerous imperfect states, is often infeasible. To compute the expected values for an imperfect information game like \textit{Reconnaissance Blind Chess}, one would need to evaluate thousands of chess positions just to obtain the training target for a single state. Still, the expected value of a state can already be approximated with appropriate accuracy from a much smaller set of evaluations. Thus, in this paper, we empirically investigate how a budget of perfect information game evaluations should be distributed among training samples to maximise the return. Our results show that sampling a small number of states, in our experiments roughly 3, for a larger number of separate positions is preferable over repeatedly sampling a smaller quantity of states. Thus, we find that in our case, the quantity of different samples seems to be more important than higher target quality.
Paper Structure (12 sections, 1 equation, 8 figures)

This paper contains 12 sections, 1 equation, 8 figures.

Figures (8)

  • Figure 1: Estimating the value of an imperfect information position (left) as the average of the perfect information evaluations of all positions in the information set.
  • Figure 2: Learning an imperfect information evaluation function from $n$ examples, for which the target evaluation is estimated from $k$ position, using a constant budget of $N = n \cdot k$ perfect information evaluations.
  • Figure 3: Monte-Carlo estimate of error in evaluating the change of winning with a given hand pre-flop- in heads-up poker. Estimations are computed by averaging over n samples of possible opponent hands and rivers.
  • Figure 4: Average training curves of learning to evaluate a poker hand with different numbers of evaluations per training example. The $x$-axis is logarithmically scaled either by the total number of hand evaluations (top) or by the total number of update steps made (bottom).
  • Figure 5: Average lowest received error for the different options of hand validations, given a total budget of either 100M hand evaluations or 1M training updates.
  • ...and 3 more figures