Table of Contents
Fetching ...

Conflux-PSRO: Effectively Leveraging Collective Advantages in Policy Space Response Oracles

Yucong Huang, Jiesong Lian, Mingzhi Wang, Chengdong Ma, Ying Wen

TL;DR

This paper proposes Conflux-PSRO, which fully exploits the diversity of the population by adaptively selecting and training policies at state-level and significantly improves the utility of BRs and reduces exploitability compared to existing methods.

Abstract

Policy Space Response Oracle (PSRO) with policy population construction has been demonstrated as an effective method for approximating Nash Equilibrium (NE) in zero-sum games. Existing studies have attempted to improve diversity in policy space, primarily by incorporating diversity regularization into the Best Response (BR). However, these methods cause the BR to deviate from maximizing rewards, easily resulting in a population that favors diversity over performance, even when diversity is not always necessary. Consequently, exploitability is difficult to reduce until policies are fully explored, especially in complex games. In this paper, we propose Conflux-PSRO, which fully exploits the diversity of the population by adaptively selecting and training policies at state-level. Specifically, Conflux-PSRO identifies useful policies from the existing population and employs a routing policy to select the most appropriate policies at each decision point, while simultaneously training them to enhance their effectiveness. Compared to the single-policy BR of traditional PSRO and its diversity-improved variants, the BR generated by Conflux-PSRO not only leverages the specialized expertise of diverse policies but also synergistically enhances overall performance. Our experiments on various environments demonstrate that Conflux-PSRO significantly improves the utility of BRs and reduces exploitability compared to existing methods.

Conflux-PSRO: Effectively Leveraging Collective Advantages in Policy Space Response Oracles

TL;DR

This paper proposes Conflux-PSRO, which fully exploits the diversity of the population by adaptively selecting and training policies at state-level and significantly improves the utility of BRs and reduces exploitability compared to existing methods.

Abstract

Policy Space Response Oracle (PSRO) with policy population construction has been demonstrated as an effective method for approximating Nash Equilibrium (NE) in zero-sum games. Existing studies have attempted to improve diversity in policy space, primarily by incorporating diversity regularization into the Best Response (BR). However, these methods cause the BR to deviate from maximizing rewards, easily resulting in a population that favors diversity over performance, even when diversity is not always necessary. Consequently, exploitability is difficult to reduce until policies are fully explored, especially in complex games. In this paper, we propose Conflux-PSRO, which fully exploits the diversity of the population by adaptively selecting and training policies at state-level. Specifically, Conflux-PSRO identifies useful policies from the existing population and employs a routing policy to select the most appropriate policies at each decision point, while simultaneously training them to enhance their effectiveness. Compared to the single-policy BR of traditional PSRO and its diversity-improved variants, the BR generated by Conflux-PSRO not only leverages the specialized expertise of diverse policies but also synergistically enhances overall performance. Our experiments on various environments demonstrate that Conflux-PSRO significantly improves the utility of BRs and reduces exploitability compared to existing methods.

Paper Structure

This paper contains 18 sections, 11 equations, 7 figures, 3 tables, 3 algorithms.

Figures (7)

  • Figure 1: Maze Game: A 7x4 grid-based, two-player, zero-sum game between a human (red) trying to reach the shelter (black) and a monster (yellow) attempting to capture the human. The human moves one grid per step, while the monster can move up to two in the same direction. The human explores three policies: $\pi_1$ (purple), $\pi_2$ (blue), and $\pi_3$ (green). $\pi_1$ and $\pi_2$ are more effective than $\pi_3$, but each leads to eventual danger if followed strictly. By switching between useful policies—starting with $\pi_2$ and switching to $\pi_1$ at critical points—the human can win, demonstrating how combining the strengths of historical policies creates a more robust policy $\pi_*$ (red), accelerating exploration and policy improvement.
  • Figure 2: Construction of Routing-policy: A routing-policy serves as a decision-making layer that adaptively selects the most appropriate sub-policy for any given state, thereby optimizing the overall system's performance.
  • Figure 3: Exploitability on Leduc Poker and Goofspie with 2e5 and 3e5 episodes for training each BR, respectively.
  • Figure 4: BR utility generated by PSD-PSRO, Conflux-PSRO, and Conflux-PSRO without distillation after one training iteration.
  • Figure 5: Exploitability on Liars Dice and Liars Dice IR with 2e5 episodes for training each BR.
  • ...and 2 more figures