Table of Contents
Fetching ...

Decentralized MARL for Coarse Correlated Equilibrium in Aggregative Markov Games

Siying Huang, Yifen Mu, Ge Chen

Abstract

This paper studies the problem of decentralized learning of Coarse Correlated Equilibrium (CCE) in aggregative Markov games (AMGs), where each agent's instantaneous reward depends only on its own action and an aggregate quantity. Existing CCE learning algorithms for general Markov games are not designed to leverage the aggregative structure, and research on decentralized CCE learning for AMGs remains limited. We propose an adaptive stage-based V-learning algorithm that exploits the aggregative structure under a fully decentralized information setting. Based on the two-timescale idea, the algorithm partitions learning into stages and adjusts stage lengths based on the variability of aggregate signals, while using no-regret updates within each stage. We prove the algorithm achieves an epsilon-approximate CCE in O(S Amax T5 / epsilon2) episodes, avoiding the curse of multiagents which commonly arises in MARL. Numerical results verify the theoretical findings, and the decentralized, model-free design enables easy extension to large-scale multi-agent scenarios.

Decentralized MARL for Coarse Correlated Equilibrium in Aggregative Markov Games

Abstract

This paper studies the problem of decentralized learning of Coarse Correlated Equilibrium (CCE) in aggregative Markov games (AMGs), where each agent's instantaneous reward depends only on its own action and an aggregate quantity. Existing CCE learning algorithms for general Markov games are not designed to leverage the aggregative structure, and research on decentralized CCE learning for AMGs remains limited. We propose an adaptive stage-based V-learning algorithm that exploits the aggregative structure under a fully decentralized information setting. Based on the two-timescale idea, the algorithm partitions learning into stages and adjusts stage lengths based on the variability of aggregate signals, while using no-regret updates within each stage. We prove the algorithm achieves an epsilon-approximate CCE in O(S Amax T5 / epsilon2) episodes, avoiding the curse of multiagents which commonly arises in MARL. Numerical results verify the theoretical findings, and the decentralized, model-free design enables easy extension to large-scale multi-agent scenarios.

Paper Structure

This paper contains 9 sections, 6 theorems, 48 equations, 4 figures, 5 algorithms.

Key Result

Theorem 3.1

(Sample complexity of learning CCE). For any $p\in (0,1]$, set $\iota = \log(2NSA_{\max} KT/p)$, and let the agents run Algorithm alg:sbv for $K$ episodes with $K= O(S A_{\max}T^5 \iota/\epsilon^2)$. Then, with probability at least $1-p$, the output policy $\bar{\pi}$ of Algorithm alg:certify is an

Figures (4)

  • Figure 1: Illustration of the adaptive stage-based mechanism.
  • Figure 2: Individual cumulative reward of two agents.
  • Figure 3: Average reward of two agents.
  • Figure 4: Rewards comparison of three algorithms on the Fishing Game. "Centralized Q" denotes a centralized oracle that controls all agents' actions to maximize the joint reward of both agents. "Independent Q" means each agent runs a naive single-agent Q-learning algorithm independently, taking greedy actions based only on local information without considering other agents.

Theorems & Definitions (7)

  • Definition 2.1
  • Theorem 3.1
  • Lemma 3.2: Estimation and regret bound
  • Lemma 3.3: Confidence bounds
  • Lemma A.1
  • Lemma A.2
  • Theorem A.3