Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response
Ariyan Bighashdel, Thiago D. Simão, Frans A. Oliehoek
TL;DR
The paper tackles the high sample cost of best-response computation in Policy Space Response Oracles (PSRO) for multi-agent settings. It introduces Joint Experience Best Response (JBR), which reuses a single joint dataset collected under the current meta-strategy to compute BRs for all agents, converting BR training into an offline RL problem. To mitigate offline-bias, it proposes Conservative JBR, Exploration-Augmented JBR with $\delta$-perturbations (random and targeted), and Hybrid BR that intermittently uses independent BR updates; theoretical results show convergence guarantees with exploration bounded by $\varepsilon + 2R\delta$ in finite two-player zero-sum games. Empirically, targeted exploration ($\delta$-T) achieves near-PSRO accuracy at a fraction of BR sample cost, and hybrids can recover PSRO-level performance with small additional cost, across discrete and continuous multi-agent environments. Overall, JBR substantially enhances PSRO practicality while preserving equilibrium robustness and non-transitive dynamics.
Abstract
Multi-agent reinforcement learning (MARL) offers a scalable alternative to exact game-theoretic analysis but suffers from non-stationarity and the need to maintain diverse populations of strategies that capture non-transitive interactions. Policy Space Response Oracles (PSRO) address these issues by iteratively expanding a restricted game with approximate best responses (BRs), yet per-agent BR training makes it prohibitively expensive in many-agent or simulator-expensive settings. We introduce Joint Experience Best Response (JBR), a drop-in modification to PSRO that collects trajectories once under the current meta-strategy profile and reuses this joint dataset to compute BRs for all agents simultaneously. This amortizes environment interaction and improves the sample efficiency of best-response computation. Because JBR converts BR computation into an offline RL problem, we propose three remedies for distribution-shift bias: (i) Conservative JBR with safe policy improvement, (ii) Exploration-Augmented JBR that perturbs data collection and admits theoretical guarantees, and (iii) Hybrid BR that interleaves JBR with periodic independent BR updates. Across benchmark multi-agent environments, Exploration-Augmented JBR achieves the best accuracy-efficiency trade-off, while Hybrid BR attains near-PSRO performance at a fraction of the sample cost. Overall, JBR makes PSRO substantially more practical for large-scale strategic learning while preserving equilibrium robustness.
