Partially Observable Multi-Agent Reinforcement Learning with Information Sharing
Xiangyu Liu, Kaiqing Zhang
TL;DR
This work studies provable MARL in partially observable POSGs under information-sharing models, showing that without structure the problem is intractable. It introduces an approximate common-information framework to compress shared history and enables quasi-polynomial planning and learning guarantees for equilibria and team-optima under several natural information-sharing structures. A meta-theoretic approach builds policy-dependent approximate models to achieve quasi-polynomial sample and time complexity, bridging planning, learning, and decentralized control theory. The results illuminate how information structure design can unlock tractable, scalable learning in multi-agent, partially observable settings and point to future work on fully decentralized regimes. The practical impact lies in informing the design of communication protocols and information-sharing schemes that yield provable efficiency in real-world MARL systems.
Abstract
We study provable multi-agent reinforcement learning (RL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we advocate leveraging the potential \emph{information-sharing} among agents, a common practice in empirical multi-agent RL, and a standard model for multi-agent control systems with communication. We first establish several computational complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-polynomial time and sample single-agent RL with partial observations, for tractably solving POSGs. Inspired by the inefficiency of planning in the ground-truth model, we then propose to further \emph{approximate} the shared common information to construct an approximate model of the POSG, in which an approximate \emph{equilibrium} (of the original POSG) can be found in quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable multi-agent RL algorithm whose time and sample complexities are \emph{both} quasi-polynomial. Finally, beyond equilibrium learning, we extend our algorithmic framework to finding the \emph{team-optimal solution} in cooperative POSGs, i.e., decentralized partially observable Markov decision processes, a more challenging goal. We establish concrete computational and sample complexities under several structural assumptions of the model. We hope our study could open up the possibilities of leveraging and even designing different \emph{information structures}, a well-studied notion in control theory, for developing both sample- and computation-efficient partially observable multi-agent RL.
