Table of Contents
Fetching ...

Decentralized Multi-Agent Reinforcement Learning for Continuous-Space Stochastic Games

Awni Altabaa, Bora Yongacoglu, Serdar Yüksel

TL;DR

The paper addresses decentralized MARL in stochastic games with general state spaces by extending decentralized Q-learning to continuous spaces through state-space quantization and a two-time-scale learning scheme. It proves that, under weak continuity and bounded costs, agents achieve near-optimal policy updates with respect to their observed environments, and it characterizes global policy-updating dynamics as an absorbing Markov chain with a closed-form expression for equilibrium convergence probabilities. By analyzing both idealized updating dynamics and their quantized approximations, the work provides conditions under which self-play converges to (near-)equilibria and discusses limitations in achieving global team-optimality. A simulation study on a two-player stochastic team corroborates the theory, illustrating convergence behavior and the impact of quantization and exploration on attaining team-optimal policies.

Abstract

Stochastic games are a popular framework for studying multi-agent reinforcement learning (MARL). Recent advances in MARL have focused primarily on games with finitely many states. In this work, we study multi-agent learning in stochastic games with general state spaces and an information structure in which agents do not observe each other's actions. In this context, we propose a decentralized MARL algorithm and we prove the near-optimality of its policy updates. Furthermore, we study the global policy-updating dynamics for a general class of best-reply based algorithms and derive a closed-form characterization of convergence probabilities over the joint policy space.

Decentralized Multi-Agent Reinforcement Learning for Continuous-Space Stochastic Games

TL;DR

The paper addresses decentralized MARL in stochastic games with general state spaces by extending decentralized Q-learning to continuous spaces through state-space quantization and a two-time-scale learning scheme. It proves that, under weak continuity and bounded costs, agents achieve near-optimal policy updates with respect to their observed environments, and it characterizes global policy-updating dynamics as an absorbing Markov chain with a closed-form expression for equilibrium convergence probabilities. By analyzing both idealized updating dynamics and their quantized approximations, the work provides conditions under which self-play converges to (near-)equilibria and discusses limitations in achieving global team-optimality. A simulation study on a two-player stochastic team corroborates the theory, illustrating convergence behavior and the impact of quantization and exploration on attaining team-optimal policies.

Abstract

Stochastic games are a popular framework for studying multi-agent reinforcement learning (MARL). Recent advances in MARL have focused primarily on games with finitely many states. In this work, we study multi-agent learning in stochastic games with general state spaces and an information structure in which agents do not observe each other's actions. In this context, we propose a decentralized MARL algorithm and we prove the near-optimality of its policy updates. Furthermore, we study the global policy-updating dynamics for a general class of best-reply based algorithms and derive a closed-form characterization of convergence probabilities over the joint policy space.
Paper Structure (14 sections, 5 theorems, 21 equations, 1 figure, 2 algorithms)

This paper contains 14 sections, 5 theorems, 21 equations, 1 figure, 2 algorithms.

Key Result

Theorem 4.1

Suppose all players use Algorithm alg:cts_dec_qlearning to select their actions. For any $\epsilon > 0$, there exists $\tilde{T}$ such that $T_k \geq \tilde{T}$ implies where $\boldsymbol{\pi}_k$ is the baseline joint policy during the $k^{\rm th}$ exploration phase and $\boldsymbol{\pi}_{k, \rho}$ is the perturbation of $\boldsymbol{\pi}_k$ that is used for action selection. Furthermore, for any

Figures (1)

  • Figure 1: Simulation results: proportion of 50 trials where the policy at the $k$th exploration phase was optimal

Theorems & Definitions (15)

  • Definition 3.1
  • Definition 3.2
  • Theorem 4.1
  • proof
  • Remark
  • Proposition 5.1
  • proof
  • Proposition 5.2
  • proof
  • Remark
  • ...and 5 more