Table of Contents
Fetching ...

Structured Cooperative Multi-Agent Reinforcement Learning: a Bayesian Network Perspective

Shahbaz P Qadri Syed, He Bai

TL;DR

This work tackles scalable cooperative MARL under partial observability by exploiting known inter-agent couplings via a Multi-agent Bayesian Network (MABN) to enable exact Q-function decomposition. It develops a partially decentralized training and decentralized execution (P-DTDE) framework and a MAStAC actor-critic algorithm, with a gradient decomposition that uses only the value dependency set $\mathcal{I}^i_Q(t)$ and an aggregated $\widehat{Q}_i$. Theoretical results show variance reduction of P-DTDE relative to CTDE under unbiased estimates and provide a policy-gradient theorem for both deterministic and stochastic policies, plus a $\kappa$-approximation to scale to dense couplings. Empirical evaluations on warehouse resource allocation and multi-zone temperature control demonstrate faster convergence, lower variance, and improved sample efficiency over standard CTDE baselines.

Abstract

The empirical success of multi-agent reinforcement learning (MARL) has motivated the search for more efficient and scalable algorithms for large scale multi-agent systems. However, existing state-of-the-art algorithms do not fully exploit inter-agent coupling information to develop MARL algorithms. In this paper, we propose a systematic approach to leverage structures in the inter-agent couplings for efficient model-free reinforcement learning. We model the cooperative MARL problem via a Bayesian network and characterize the subset of agents, termed as the value dependency set, whose information is required by each agent to estimate its local action value function exactly. Moreover, we propose a partially decentralized training decentralized execution (P-DTDE) paradigm based on the value dependency set. We theoretically establish that the total variance of our P-DTDE policy gradient estimator is less than the centralized training decentralized execution (CTDE) policy gradient estimator. We derive a multi-agent policy gradient theorem based on the P-DTDE scheme and develop a scalable actor-critic algorithm. We demonstrate the efficiency and scalability of the proposed algorithm on multi-warehouse resource allocation and multi-zone temperature control examples. For dense value dependency sets, we propose an approximation scheme based on truncation of the Bayesian network and empirically show that it achieves a faster convergence than the exact value dependence set for applications with a large number of agents.

Structured Cooperative Multi-Agent Reinforcement Learning: a Bayesian Network Perspective

TL;DR

This work tackles scalable cooperative MARL under partial observability by exploiting known inter-agent couplings via a Multi-agent Bayesian Network (MABN) to enable exact Q-function decomposition. It develops a partially decentralized training and decentralized execution (P-DTDE) framework and a MAStAC actor-critic algorithm, with a gradient decomposition that uses only the value dependency set and an aggregated . Theoretical results show variance reduction of P-DTDE relative to CTDE under unbiased estimates and provide a policy-gradient theorem for both deterministic and stochastic policies, plus a -approximation to scale to dense couplings. Empirical evaluations on warehouse resource allocation and multi-zone temperature control demonstrate faster convergence, lower variance, and improved sample efficiency over standard CTDE baselines.

Abstract

The empirical success of multi-agent reinforcement learning (MARL) has motivated the search for more efficient and scalable algorithms for large scale multi-agent systems. However, existing state-of-the-art algorithms do not fully exploit inter-agent coupling information to develop MARL algorithms. In this paper, we propose a systematic approach to leverage structures in the inter-agent couplings for efficient model-free reinforcement learning. We model the cooperative MARL problem via a Bayesian network and characterize the subset of agents, termed as the value dependency set, whose information is required by each agent to estimate its local action value function exactly. Moreover, we propose a partially decentralized training decentralized execution (P-DTDE) paradigm based on the value dependency set. We theoretically establish that the total variance of our P-DTDE policy gradient estimator is less than the centralized training decentralized execution (CTDE) policy gradient estimator. We derive a multi-agent policy gradient theorem based on the P-DTDE scheme and develop a scalable actor-critic algorithm. We demonstrate the efficiency and scalability of the proposed algorithm on multi-warehouse resource allocation and multi-zone temperature control examples. For dense value dependency sets, we propose an approximation scheme based on truncation of the Bayesian network and empirically show that it achieves a faster convergence than the exact value dependence set for applications with a large number of agents.

Paper Structure

This paper contains 21 sections, 4 theorems, 52 equations, 11 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.2

For any $i\in \mathcal{V}$, $Q^\pi_i(s(t),a(t))$ depends only on $(s_{\mathcal{I}^i_Q(t)}(t), a_{\mathcal{I}^i_Q(t)}(t))$.

Figures (11)

  • Figure 1: Comparison of the total average reward for 15 MC simulations. From left to right are Example 1 to 3.
  • Figure 2: Ablation study of the value dependency set in the 9-warehouse example.
  • Figure 3: The state graph $\mathcal{G}_S$ of a MAS (left) and the edges corresponding to agents $2,3,5$ in the MABN for $t=0,1$ (right).
  • Figure 4: The observation graph $\mathcal{G}_O$ of a MAS (left) and the edges corresponding to agents $1,2,3,4,5,6$ in the MABN (right).
  • Figure 5: The reward graph $\mathcal{G}_R$ of a MAS (left) and the edges corresponding to agents $1,2,3,4,5,6$ in the MABN (right).
  • ...and 6 more figures

Theorems & Definitions (14)

  • Remark 3.1
  • Theorem 3.2
  • Remark 3.4
  • Remark 3.5
  • Theorem 3.6: Gradient decomposition theorem
  • proof
  • Theorem 3.7
  • Theorem 4.2
  • proof
  • Definition 4.3: $\kappa-$approximated value dependency graph
  • ...and 4 more