Approximate Linear Programming for Decentralized Policy Iteration in Cooperative Multi-agent Markov Decision Processes
Lakshmi Mandal, Chandrashekar Lakshminarayanan, Shalabh Bhatnagar
TL;DR
This work addresses the computational bottleneck of policy iteration in cooperative multi-agent MDPs with exponentially large joint action spaces by introducing approximate decentralized policy iteration (ADPI) based on approximate linear programming (ALP). The proposed finite-horizon and infinite-horizon algorithms compute approximate value functions via ALP and perform decentralized policy improvements, with theoretical cost-improvement guarantees. Empirical results on standard cooperative tasks show that ADPI with ALP converges faster and achieves competitive or better performance than existing exact-value or fully decentralized approaches, while achieving substantial dimensionality reduction. The contributions demonstrate scalable, provable, and practically effective decentralized planning for large multi-agent systems.
Abstract
In this work, we consider a cooperative multi-agent Markov decision process (MDP) involving m agents. At each decision epoch, all the m agents independently select actions in order to maximize a common long-term objective. In the policy iteration process of multi-agent setup, the number of actions grows exponentially with the number of agents, incurring huge computational costs. Thus, recent works consider decentralized policy improvement, where each agent improves its decisions unilaterally, assuming that the decisions of the other agents are fixed. However, exact value functions are considered in the literature, which is computationally expensive for a large number of agents with high dimensional state-action space. Thus, we propose approximate decentralized policy iteration algorithms, using approximate linear programming with function approximation to compute the approximate value function for decentralized policy improvement. Further, we consider (both) cooperative multi-agent finite and infinite horizon discounted MDPs and propose suitable algorithms in each case. Moreover, we provide theoretical guarantees for our algorithms and also demonstrate their advantages over existing state-of-the-art algorithms in the literature.
