Table of Contents
Fetching ...

Sample-Efficient Regret-Minimizing Double Oracle in Extensive-Form Games

Xiaohang Tang, Chiyuan Wang, Chengdong Ma, Ilija Bogunovic, Stephen McAleer, Yaodong Yang

TL;DR

Adaptive Double Oracle (AdaDO) is introduced to significantly alleviate sample complexity to polynomial by deploying the optimal expansion frequency and combining RMDO with warm starting and stochastic regret minimization further improves convergence rate and scalability, thereby paving the way for addressing complex multi-agent tasks.

Abstract

Extensive-Form Game (EFG) represents a fundamental model for analyzing sequential interactions among multiple agents and the primary challenge to solve it lies in mitigating sample complexity. Existing research indicated that Double Oracle (DO) can reduce the sample complexity dependence on the information set number $|S|$ to the final restricted game size $X$ in solving EFG. This is attributed to the early convergence of full-game Nash Equilibrium (NE) through iteratively solving restricted games. However, we prove that the state-of-the-art Extensive-Form Double Oracle (XDO) exhibits \textit{exponential} sample complexity of $X$, due to its exponentially increasing restricted game expansion frequency. Here we introduce Adaptive Double Oracle (AdaDO) to significantly alleviate sample complexity to \textit{polynomial} by deploying the optimal expansion frequency. Furthermore, to comprehensively study the principles and influencing factors underlying sample complexity, we introduce a novel theoretical framework Regret-Minimizing Double Oracle (RMDO) to provide directions for designing efficient DO algorithms. Empirical results demonstrate that AdaDO attains the more superior approximation of NE with less sample complexity than the strong baselines including Linear CFR, MCCFR and existing DO. Importantly, combining RMDO with warm starting and stochastic regret minimization further improves convergence rate and scalability, thereby paving the way for addressing complex multi-agent tasks.

Sample-Efficient Regret-Minimizing Double Oracle in Extensive-Form Games

TL;DR

Adaptive Double Oracle (AdaDO) is introduced to significantly alleviate sample complexity to polynomial by deploying the optimal expansion frequency and combining RMDO with warm starting and stochastic regret minimization further improves convergence rate and scalability, thereby paving the way for addressing complex multi-agent tasks.

Abstract

Extensive-Form Game (EFG) represents a fundamental model for analyzing sequential interactions among multiple agents and the primary challenge to solve it lies in mitigating sample complexity. Existing research indicated that Double Oracle (DO) can reduce the sample complexity dependence on the information set number to the final restricted game size in solving EFG. This is attributed to the early convergence of full-game Nash Equilibrium (NE) through iteratively solving restricted games. However, we prove that the state-of-the-art Extensive-Form Double Oracle (XDO) exhibits \textit{exponential} sample complexity of , due to its exponentially increasing restricted game expansion frequency. Here we introduce Adaptive Double Oracle (AdaDO) to significantly alleviate sample complexity to \textit{polynomial} by deploying the optimal expansion frequency. Furthermore, to comprehensively study the principles and influencing factors underlying sample complexity, we introduce a novel theoretical framework Regret-Minimizing Double Oracle (RMDO) to provide directions for designing efficient DO algorithms. Empirical results demonstrate that AdaDO attains the more superior approximation of NE with less sample complexity than the strong baselines including Linear CFR, MCCFR and existing DO. Importantly, combining RMDO with warm starting and stochastic regret minimization further improves convergence rate and scalability, thereby paving the way for addressing complex multi-agent tasks.

Paper Structure

This paper contains 31 sections, 16 theorems, 43 equations, 7 figures, 4 tables, 2 algorithms.

Key Result

Lemma 3

$\min_{\pi\in \Pi^*}\max_{s\in S}\text{supp}^{\pi}(s) < k \leq X \leq |S| =\sum_i |S_i|$, where $k$ is the number of restricted game during the whole process of Double Oracle.

Figures (7)

  • Figure 1: Flow chart of existing Double Oracle algorithms: Double Oracle (DO), Extensive-form Double Oracle (XDO), Online Double Oracle (ODO), and the method we proposed, namely Regret-Minimizing Double Oracle (RMDO).
  • Figure 2: Restricted game expanding and warm starting of Regret Minimizing Double Oracle in EFGs. In restricted game, the regret minimizer will keep updating the regret and average strategy. After $m(\cdot)$ iterations of regret minimization, we compute best response actions (BR) against the restricted strategy (violet, orange and blue bars). If there are new BR actions, we expand the restricted game with them.
  • Figure 3: Exploitability-Visited Nodes Performance of Extensive-form Double Oracle (XDO), Periodic Double Oracle with its periodicity, i.e. PDO($c$), Extensive-form Online Double Oracle (XODO), Extensive-form Fictitious Self-Play (XFP), and Linear Counterfactual Regret Minimization (LCFR). Our algorithm PDO achieves the lower exploitability than any other methods.
  • Figure 4: Exploitability-Visited Nodes Performance of PDO with and without warm starting, AdaDO with and without warm starting, and LCFR. Warm starting help reduce exploitability significantly in Blotto and Large Kuhn Poker. AdaDO outperforms LCFR and PDO in most games.
  • Figure 5: Exploitability experiments of Stochastic PDO (SPDO), and Stochastic Adaptive DO (SADO) with and without warm starting and Outcome-Sampling Monte-Carlo CFR (MCCFR). SPDO and SADO performs similarly good, and outperform MCCFR significantly.
  • ...and 2 more figures

Theorems & Definitions (19)

  • Definition 1: Frequency Function
  • Definition 2
  • Lemma 3
  • Theorem 4
  • Proposition 5
  • Proposition 6
  • Proposition 7
  • Definition 8: Adaptive Frequency Function
  • Proposition 9
  • Theorem 10
  • ...and 9 more