Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response

Ariyan Bighashdel; Thiago D. Simão; Frans A. Oliehoek

Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response

Ariyan Bighashdel, Thiago D. Simão, Frans A. Oliehoek

TL;DR

The paper tackles the high sample cost of best-response computation in Policy Space Response Oracles (PSRO) for multi-agent settings. It introduces Joint Experience Best Response (JBR), which reuses a single joint dataset collected under the current meta-strategy to compute BRs for all agents, converting BR training into an offline RL problem. To mitigate offline-bias, it proposes Conservative JBR, Exploration-Augmented JBR with $\delta$-perturbations (random and targeted), and Hybrid BR that intermittently uses independent BR updates; theoretical results show convergence guarantees with exploration bounded by $\varepsilon + 2R\delta$ in finite two-player zero-sum games. Empirically, targeted exploration ($\delta$-T) achieves near-PSRO accuracy at a fraction of BR sample cost, and hybrids can recover PSRO-level performance with small additional cost, across discrete and continuous multi-agent environments. Overall, JBR substantially enhances PSRO practicality while preserving equilibrium robustness and non-transitive dynamics.

Abstract

Multi-agent reinforcement learning (MARL) offers a scalable alternative to exact game-theoretic analysis but suffers from non-stationarity and the need to maintain diverse populations of strategies that capture non-transitive interactions. Policy Space Response Oracles (PSRO) address these issues by iteratively expanding a restricted game with approximate best responses (BRs), yet per-agent BR training makes it prohibitively expensive in many-agent or simulator-expensive settings. We introduce Joint Experience Best Response (JBR), a drop-in modification to PSRO that collects trajectories once under the current meta-strategy profile and reuses this joint dataset to compute BRs for all agents simultaneously. This amortizes environment interaction and improves the sample efficiency of best-response computation. Because JBR converts BR computation into an offline RL problem, we propose three remedies for distribution-shift bias: (i) Conservative JBR with safe policy improvement, (ii) Exploration-Augmented JBR that perturbs data collection and admits theoretical guarantees, and (iii) Hybrid BR that interleaves JBR with periodic independent BR updates. Across benchmark multi-agent environments, Exploration-Augmented JBR achieves the best accuracy-efficiency trade-off, while Hybrid BR attains near-PSRO performance at a fraction of the sample cost. Overall, JBR makes PSRO substantially more practical for large-scale strategic learning while preserving equilibrium robustness.

Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response

TL;DR

-perturbations (random and targeted), and Hybrid BR that intermittently uses independent BR updates; theoretical results show convergence guarantees with exploration bounded by

in finite two-player zero-sum games. Empirically, targeted exploration (

-T) achieves near-PSRO accuracy at a fraction of BR sample cost, and hybrids can recover PSRO-level performance with small additional cost, across discrete and continuous multi-agent environments. Overall, JBR substantially enhances PSRO practicality while preserving equilibrium robustness and non-transitive dynamics.

Abstract

Paper Structure (35 sections, 6 theorems, 17 equations, 4 figures, 2 algorithms)

This paper contains 35 sections, 6 theorems, 17 equations, 4 figures, 2 algorithms.

Introduction
Related Work
Background
Method
Induced MDP
Independent Best Response
Joint Experience Best Response
Naïve JBR
Conservative JBR
Exploration-Augmented JBR
$\delta$-random exploration.
$\delta$-targeted exploration.
Theoretical Implication
Proof sketch.
Implication.
...and 20 more sections

Key Result

theorem 1

Let $\sigma$ be the current meta-strategy profile and $\tilde{\sigma}$ its $\delta$-perturbed variant used for data collection. If each agent computes an $\varepsilon$-best response to $\tilde{\sigma}$, then upon termination the resulting meta-strategy is an $(\varepsilon + 2R\delta)$-Nash equilibri

Figures (4)

Figure 1: Sample-efficiency -- accuracy trade-off in Leduc Poker. Shown are total best-response episodes (in millions, $x$-axis) versus minimum NashConv after 100 iterations ($y$-axis). Standard PSRO is accurate but requires the highest BR sample cost. Joint Experience Best Response (JBR-PSRO) and its enhanced variants— conservative (JBR-PSRO-SPI), exploration-augmented (JBR-PSRO-$\delta$R, JBR-PSRO-$\delta$T), and hybrid (HBR-PSRO(10/30)-$\delta$T)—drastically reduce the number of BR episodes needed for convergence, with JBR-PSRO-$\delta$T achieving the best trade-off and hybrid versions approaching PSRO-level accuracy.
Figure 2: Convergence of PSRO and JBR in two poker games of increasing complexity. Left: In Kuhn Poker, JBR remains close to PSRO, indicating that it is feasible when the state space is small and well covered. Right: In Leduc Poker, naive JBR diverges from PSRO as complexity grows, revealing the offline-learning bias that motivates the JBR variants.
Figure 3: Effect of the exploration rate $\delta$ in Leduc Poker. Minimum NashConv after 100 iterations for random (JBR-PSRO-$\delta$R) and targeted (JBR-PSRO-$\delta$T) exploration. Random exploration peaks at $\delta{=}0.1$ then degrades beyond $0.4$, while targeted exploration improves up to $\delta{=}0.5$ and remains consistently better than naïve JBR.
Figure 4: Approximate NashConv in continuous multi-agent environments. Comparison of PSRO, JBR-PSRO, JBR-PSRO-$\delta$T, and two MARL baselines (IL/DDPG, CTDE/MADDPG) on Simple Tag, Simple Adversary, and Simple Push. PSRO and JBR-PSRO-$\delta$T achieve comparable approximate NashConv across games; both PSRO methods outperform IL and CTDE.

Theorems & Definitions (6)

theorem 1: Exploration-augmented JBR
lemma 1: Linearity in mixtures
lemma 2: Stability under perturbations
lemma 3: BR to perturbed $\Rightarrow$ approx-BR to true
theorem 2: PSRO with perturbed targets
corollary 1: Exploration-augmented JBR

Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response

TL;DR

Abstract

Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (6)