Table of Contents
Fetching ...

DiffFP: Learning Behaviors from Scratch via Diffusion-based Fictitious Play

Akash Karthikeyan, Yash Vardhan Pant

TL;DR

The paper tackles learning robust strategies in dynamic, continuous-action multi-agent settings by addressing non-stationarity and vulnerability to unseen opponents. It introduces DiffFP, a diffusion-based fictitious-play framework where a diffusion policy models the best response to an evolving average opponent strategy, enabling multimodal and robust behavior learned from scratch. The approach demonstrates convergence toward a $ε$-Nash equilibrium in continuous zero-sum games and shows substantial gains in convergence speed (up to $3×$) and success rates (up to $30×$) over RL baselines across racing and multi-particle environments. These results suggest diffusion-based best responses offer robust, generalizable policies for competitive multi-agent tasks with continuous actions, with significant implications for reliable AI in dynamic settings.

Abstract

Self-play reinforcement learning has demonstrated significant success in learning complex strategic and interactive behaviors in competitive multi-agent games. However, achieving such behaviors in continuous decision spaces remains challenging. Ensuring adaptability and generalization in self-play settings is critical for achieving competitive performance in dynamic multi-agent environments. These challenges often cause methods to converge slowly or fail to converge at all to a Nash equilibrium, making agents vulnerable to strategic exploitation by unseen opponents. To address these challenges, we propose DiffFP, a fictitious play (FP) framework that estimates the best response to unseen opponents while learning a robust and multimodal behavioral policy. Specifically, we approximate the best response using a diffusion policy that leverages generative modeling to learn adaptive and diverse strategies. Through empirical evaluation, we demonstrate that the proposed FP framework converges towards $ε$-Nash equilibria in continuous- space zero-sum games. We validate our method on complex multi-agent environments, including racing and multi-particle zero-sum games. Simulation results show that the learned policies are robust against diverse opponents and outperform baseline reinforcement learning policies. Our approach achieves up to 3$\times$ faster convergence and 30$\times$ higher success rates on average against RL-based baselines, demonstrating its robustness to opponent strategies and stability across training iterations

DiffFP: Learning Behaviors from Scratch via Diffusion-based Fictitious Play

TL;DR

The paper tackles learning robust strategies in dynamic, continuous-action multi-agent settings by addressing non-stationarity and vulnerability to unseen opponents. It introduces DiffFP, a diffusion-based fictitious-play framework where a diffusion policy models the best response to an evolving average opponent strategy, enabling multimodal and robust behavior learned from scratch. The approach demonstrates convergence toward a -Nash equilibrium in continuous zero-sum games and shows substantial gains in convergence speed (up to ) and success rates (up to ) over RL baselines across racing and multi-particle environments. These results suggest diffusion-based best responses offer robust, generalizable policies for competitive multi-agent tasks with continuous actions, with significant implications for reliable AI in dynamic settings.

Abstract

Self-play reinforcement learning has demonstrated significant success in learning complex strategic and interactive behaviors in competitive multi-agent games. However, achieving such behaviors in continuous decision spaces remains challenging. Ensuring adaptability and generalization in self-play settings is critical for achieving competitive performance in dynamic multi-agent environments. These challenges often cause methods to converge slowly or fail to converge at all to a Nash equilibrium, making agents vulnerable to strategic exploitation by unseen opponents. To address these challenges, we propose DiffFP, a fictitious play (FP) framework that estimates the best response to unseen opponents while learning a robust and multimodal behavioral policy. Specifically, we approximate the best response using a diffusion policy that leverages generative modeling to learn adaptive and diverse strategies. Through empirical evaluation, we demonstrate that the proposed FP framework converges towards -Nash equilibria in continuous- space zero-sum games. We validate our method on complex multi-agent environments, including racing and multi-particle zero-sum games. Simulation results show that the learned policies are robust against diverse opponents and outperform baseline reinforcement learning policies. Our approach achieves up to 3 faster convergence and 30 higher success rates on average against RL-based baselines, demonstrating its robustness to opponent strategies and stability across training iterations

Paper Structure

This paper contains 27 sections, 13 equations, 4 figures, 4 tables, 2 algorithms.

Figures (4)

  • Figure 1: (A–B) Failure Modes of Baselines. Baseline agents exhibit suboptimal behaviors such as stalling or inefficient path planning. (C–E) Training Progression of DiffFP. Agent trajectories sampled from successive FP iterations illustrate the emergence of strategic behaviors. Over time, agents learn to navigate more efficiently, reducing their time-to-goal from 80 to 45 steps under identical initial conditions.
  • Figure 2: Exploitability. We report exploitability (see Def. \ref{['def:exp']}) computed over 100 evaluation runs. Lower values indicate better robustness, with the proposed DiffFP achieving the lowest exploitability.
  • Figure 3: A. Exploitability on the Racing Task. Mean and standard deviation of exploitability over 10 episodes per iteration. B. Normalized Episodic Rewards. This metric reflects training performance and stability.
  • Figure 4: A. MPE-Tag Env., B. MPE-Adversary Env. Dashed lines indicate possible agent trajectories. C. Q-value Map. At convergence to a near Nash equilibrium, ego agents exhibit coordinated behaviors with diverse roles e.g., one agent advances toward the goal while the other acts as a decoy to distract or mislead the adversaries.

Theorems & Definitions (5)

  • Remark 1: Two-player zero-sum game
  • Definition 1: Best Response
  • Definition 2: $\bm\varepsilon$-Nash Equilibrium
  • Definition 3: Exploitability
  • Remark 2: CTDE for Multi-Agent Teams