Table of Contents
Fetching ...

Reevaluating Policy Gradient Methods for Imperfect-Information Games

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J. Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, Samuel Sokota

Abstract

In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP-, DO-, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for five large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 7000 training runs, we find that FP-, DO-, and CFR-based approaches fail to outperform generic policy gradient methods. Code is available at https://github.com/nathanlct/IIG-RL-Benchmark and https://github.com/gabrfarina/exp-a-spiel .

Reevaluating Policy Gradient Methods for Imperfect-Information Games

Abstract

In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP-, DO-, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for five large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 7000 training runs, we find that FP-, DO-, and CFR-based approaches fail to outperform generic policy gradient methods. Code is available at https://github.com/nathanlct/IIG-RL-Benchmark and https://github.com/gabrfarina/exp-a-spiel .

Paper Structure

This paper contains 54 sections, 2 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Tic-Tac-Toe (left) and 3x3 Hex (right).
  • Figure 2: Exploitability results. For each combination of game and algorithm, the box-and-whisker pair depicts the distribution of final exploitability over the runs from the hyperparameter tuning launch (left) and evaluation launch (right) with square-root y-axis scale. R-NaD, NFSP, ESCHER, and PSRO failed to outperform generic PG methods (MMD, PPO, PPG).
  • Figure 3: Head-to-head evaluations. The number in each cell is the expected return of the row algorithm against the column algorithm when each plays half of the games as the first moving player. R-NaD, NFSP, ESCHER, and PSRO failed to outperform generic PG methods (MMD, PPO, PPG), which are segregated by the dashed red lines.
  • Figure 4: Performance of various tabular methods.
  • Figure 5: A deterministic strategy for player 1 that always wins. The gray dashed lines denote a hidden action of the blue player.
  • ...and 10 more figures