Table of Contents
Fetching ...

Approximate Linear Programming for Decentralized Policy Iteration in Cooperative Multi-agent Markov Decision Processes

Lakshmi Mandal, Chandrashekar Lakshminarayanan, Shalabh Bhatnagar

TL;DR

This work addresses the computational bottleneck of policy iteration in cooperative multi-agent MDPs with exponentially large joint action spaces by introducing approximate decentralized policy iteration (ADPI) based on approximate linear programming (ALP). The proposed finite-horizon and infinite-horizon algorithms compute approximate value functions via ALP and perform decentralized policy improvements, with theoretical cost-improvement guarantees. Empirical results on standard cooperative tasks show that ADPI with ALP converges faster and achieves competitive or better performance than existing exact-value or fully decentralized approaches, while achieving substantial dimensionality reduction. The contributions demonstrate scalable, provable, and practically effective decentralized planning for large multi-agent systems.

Abstract

In this work, we consider a cooperative multi-agent Markov decision process (MDP) involving m agents. At each decision epoch, all the m agents independently select actions in order to maximize a common long-term objective. In the policy iteration process of multi-agent setup, the number of actions grows exponentially with the number of agents, incurring huge computational costs. Thus, recent works consider decentralized policy improvement, where each agent improves its decisions unilaterally, assuming that the decisions of the other agents are fixed. However, exact value functions are considered in the literature, which is computationally expensive for a large number of agents with high dimensional state-action space. Thus, we propose approximate decentralized policy iteration algorithms, using approximate linear programming with function approximation to compute the approximate value function for decentralized policy improvement. Further, we consider (both) cooperative multi-agent finite and infinite horizon discounted MDPs and propose suitable algorithms in each case. Moreover, we provide theoretical guarantees for our algorithms and also demonstrate their advantages over existing state-of-the-art algorithms in the literature.

Approximate Linear Programming for Decentralized Policy Iteration in Cooperative Multi-agent Markov Decision Processes

TL;DR

This work addresses the computational bottleneck of policy iteration in cooperative multi-agent MDPs with exponentially large joint action spaces by introducing approximate decentralized policy iteration (ADPI) based on approximate linear programming (ALP). The proposed finite-horizon and infinite-horizon algorithms compute approximate value functions via ALP and perform decentralized policy improvements, with theoretical cost-improvement guarantees. Empirical results on standard cooperative tasks show that ADPI with ALP converges faster and achieves competitive or better performance than existing exact-value or fully decentralized approaches, while achieving substantial dimensionality reduction. The contributions demonstrate scalable, provable, and practically effective decentralized planning for large multi-agent systems.

Abstract

In this work, we consider a cooperative multi-agent Markov decision process (MDP) involving m agents. At each decision epoch, all the m agents independently select actions in order to maximize a common long-term objective. In the policy iteration process of multi-agent setup, the number of actions grows exponentially with the number of agents, incurring huge computational costs. Thus, recent works consider decentralized policy improvement, where each agent improves its decisions unilaterally, assuming that the decisions of the other agents are fixed. However, exact value functions are considered in the literature, which is computationally expensive for a large number of agents with high dimensional state-action space. Thus, we propose approximate decentralized policy iteration algorithms, using approximate linear programming with function approximation to compute the approximate value function for decentralized policy improvement. Further, we consider (both) cooperative multi-agent finite and infinite horizon discounted MDPs and propose suitable algorithms in each case. Moreover, we provide theoretical guarantees for our algorithms and also demonstrate their advantages over existing state-of-the-art algorithms in the literature.
Paper Structure (10 sections, 2 theorems, 15 equations, 4 figures, 2 tables, 5 algorithms)

This paper contains 10 sections, 2 theorems, 15 equations, 4 figures, 2 tables, 5 algorithms.

Key Result

Theorem 1

The following inequality holds:

Figures (4)

  • Figure 1: Performance in terms of average reward and standard deviation obtained from $(a)$\ref{['alg:FH_PI_ALP']}$, (b)$\ref{['alg:IH_PI_ALP']}, for different numbers of agents on App. 2.
  • Figure 2: Performance comparisons ($m=5$) of $(a)$\ref{['alg:FH_PI_ALP']}$, (b)~$\ref{['alg:IH_PI_ALP']} with other algorithms on App. 2.
  • Figure 3: Performance in terms of average reward and standard deviation obtained from $(a)$ Algorithm 3, $(b)$ Algorithm 5 for different number of agents for 100,000 iterations on App. 2.
  • Figure 4: Performance comparisons ($m=5$) of $(a)$ Algorithm 3, $(b)$ Algorithm 5 with other algorithms for 100,000 iterations on App. 2.

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Remark 1
  • Theorem 2
  • proof
  • Remark 2