Table of Contents
Fetching ...

Offline Fictitious Self-Play for Competitive Games

Jingxiao Chen, Weiji Xie, Weinan Zhang, Yong yu, Ying Wen

TL;DR

This work tackles offline multi-agent reinforcement learning in competitive, zero-sum extensive-form games by introducing Offline Self-Play (Off-SP) and Offline Fictitious Self-Play (Off-FSP). It leverages importance-sampling reweighting of fixed datasets to simulate BRs against diverse opponents and combines single-agent offline RL with FSP to approximate Nash equilibria under partial coverage, addressing extrapolation concerns with OOD actions. Across extensive-form games (e.g., Leduc Poker, Large Kuhn Poker, Oshi Zumo) and a real-world human-robot ball-defence task, Off-FSP achieves significantly lower exploitability (NashConv) than baselines and demonstrates practical viability in non-simulated settings. The approach is modular, enabling integration with various offline RL algorithms (e.g., CQL, CRR, BCQ) and offering a scalable pathway for robust decision-making in real-world competitive environments where simulators are unavailable or costly.

Abstract

Offline Reinforcement Learning (RL) enables policy improvement from fixed datasets without online interactions, making it highly suitable for real-world applications lacking efficient simulators. Despite its success in the single-agent setting, offline multi-agent RL remains a challenge, especially in competitive games. Firstly, unaware of the game structure, it is impossible to interact with the opponents and conduct a major learning paradigm, self-play, for competitive games. Secondly, real-world datasets cannot cover all the state and action space in the game, resulting in barriers to identifying Nash equilibrium (NE). To address these issues, this paper introduces OFF-FSP, the first practical model-free offline RL algorithm for competitive games. We start by simulating interactions with various opponents by adjusting the weights of the fixed dataset with importance sampling. This technique allows us to learn the best responses to different opponents and employ the Offline Self-Play learning framework. To overcome the challenge of partial coverage, we combine the single-agent offline RL method with Fictitious Self-Play (FSP) to approximate NE by constraining the approximate best responses away from out-of-distribution actions. Experiments on matrix games, extensive-form poker, and board games demonstrate that OFF-FSP achieves significantly lower exploitability than state-of-the-art baselines. Finally, we validate OFF-FSP on a real-world human-robot competitive task, demonstrating its potential for solving complex, hard-to-simulate real-world problems.

Offline Fictitious Self-Play for Competitive Games

TL;DR

This work tackles offline multi-agent reinforcement learning in competitive, zero-sum extensive-form games by introducing Offline Self-Play (Off-SP) and Offline Fictitious Self-Play (Off-FSP). It leverages importance-sampling reweighting of fixed datasets to simulate BRs against diverse opponents and combines single-agent offline RL with FSP to approximate Nash equilibria under partial coverage, addressing extrapolation concerns with OOD actions. Across extensive-form games (e.g., Leduc Poker, Large Kuhn Poker, Oshi Zumo) and a real-world human-robot ball-defence task, Off-FSP achieves significantly lower exploitability (NashConv) than baselines and demonstrates practical viability in non-simulated settings. The approach is modular, enabling integration with various offline RL algorithms (e.g., CQL, CRR, BCQ) and offering a scalable pathway for robust decision-making in real-world competitive environments where simulators are unavailable or costly.

Abstract

Offline Reinforcement Learning (RL) enables policy improvement from fixed datasets without online interactions, making it highly suitable for real-world applications lacking efficient simulators. Despite its success in the single-agent setting, offline multi-agent RL remains a challenge, especially in competitive games. Firstly, unaware of the game structure, it is impossible to interact with the opponents and conduct a major learning paradigm, self-play, for competitive games. Secondly, real-world datasets cannot cover all the state and action space in the game, resulting in barriers to identifying Nash equilibrium (NE). To address these issues, this paper introduces OFF-FSP, the first practical model-free offline RL algorithm for competitive games. We start by simulating interactions with various opponents by adjusting the weights of the fixed dataset with importance sampling. This technique allows us to learn the best responses to different opponents and employ the Offline Self-Play learning framework. To overcome the challenge of partial coverage, we combine the single-agent offline RL method with Fictitious Self-Play (FSP) to approximate NE by constraining the approximate best responses away from out-of-distribution actions. Experiments on matrix games, extensive-form poker, and board games demonstrate that OFF-FSP achieves significantly lower exploitability than state-of-the-art baselines. Finally, we validate OFF-FSP on a real-world human-robot competitive task, demonstrating its potential for solving complex, hard-to-simulate real-world problems.
Paper Structure (37 sections, 1 theorem, 11 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 37 sections, 1 theorem, 11 equations, 11 figures, 3 tables, 2 algorithms.

Key Result

Theorem 4.3

For player $i$, the weight of transferring the opponent from $\pi^{-i}_b$ to $\pi^{-i}$ is:

Figures (11)

  • Figure 1: Comparison of offline self-play with other learning paradigms for real-world competitive tasks. Online self-play requires a simulator and suffers from the sim-to-real gap. Behavior cloning needs costly expert data. Naive offline RL optimizes with incorrect objectives. In contrast, Off-SP learns from inexpensive, low-quality data. See \ref{['sec:real_world']} for experiments on the illustrated real-world game.
  • Figure 2: Example Datasets of RPS. Numbers in the grids show the probability density of different samples. The red dashed boxes indicate the probability of different actions for corresponding behavioural policies.
  • Figure 3: Results of RPS. The prefix of D1- and D2- are refering to restuls on the first and second datasets respectively.
  • Figure 4: Illustration of Off-SP, Off-FSP and three essential steps. The green box in (b) is Off-SP.
  • Figure 5: An illustration example of trajectory $\tau^1$ and $\tau_E$. Purple parts represent Player 1. The yellow arrows imply the projection relationship under function $\mathcal{F}^1$. Green part indicates $\tau_{<}^1$.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Definition 4.1: Fully Covered Dataset
  • Definition 4.2: Real-Equivalence Dataset
  • Theorem 4.3
  • proof