Multi-Agent Reinforcement Learning for Multi-Cell Spectrum and Power Allocation
Yiming Zhang, Dongning Guo
TL;DR
The paper addresses scalable, low-latency radio resource allocation in dense multi-cell networks by casting the problem as a traffic-driven Dec-POMDP-IR and solving it with MAPPO using recurrent networks. It introduces two MARL-based solutions—fully distributed individual policies and a shared-policy variant—operating on local observations and neighborhood information to minimize average packet delay via a queue-length based reward. Empirical results show performance comparable to genie-aided centralized schemes (e.g., FP, WMMSE) with significantly lower execution times, and demonstrated robustness across network sizes and traffic conditions. The work offers a scalable framework for decentralized spectrum and power allocation applicable to conflict-graphs and cellular deployments, with potential for broad extension in resource allocation problems.
Abstract
This paper introduces a novel approach to radio resource allocation in multi-cell wireless networks using a fully scalable multi-agent reinforcement learning (MARL) framework. A distributed method is developed where agents control individual cells and determine spectrum and power allocation based on limited local information, yet achieve quality of service (QoS) performance comparable to centralized methods using global information. The objective is to minimize packet delays across devices under stochastic arrivals and applies to both conflict graph abstractions and cellular network configurations. This is formulated as a distributed learning problem, implementing a multi-agent proximal policy optimization (MAPPO) algorithm with recurrent neural networks and queueing dynamics. This traffic-driven MARL-based solution enables decentralized training and execution, ensuring scalability to large networks. Extensive simulations demonstrate that the proposed methods achieve comparable QoS performance to genie-aided centralized algorithms with significantly less execution time. The trained policies also exhibit scalability and robustness across various network sizes and traffic conditions.
