Table of Contents
Fetching ...

Partially Observable Multi-Agent Reinforcement Learning with Information Sharing

Xiangyu Liu, Kaiqing Zhang

TL;DR

This work studies provable MARL in partially observable POSGs under information-sharing models, showing that without structure the problem is intractable. It introduces an approximate common-information framework to compress shared history and enables quasi-polynomial planning and learning guarantees for equilibria and team-optima under several natural information-sharing structures. A meta-theoretic approach builds policy-dependent approximate models to achieve quasi-polynomial sample and time complexity, bridging planning, learning, and decentralized control theory. The results illuminate how information structure design can unlock tractable, scalable learning in multi-agent, partially observable settings and point to future work on fully decentralized regimes. The practical impact lies in informing the design of communication protocols and information-sharing schemes that yield provable efficiency in real-world MARL systems.

Abstract

We study provable multi-agent reinforcement learning (RL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we advocate leveraging the potential \emph{information-sharing} among agents, a common practice in empirical multi-agent RL, and a standard model for multi-agent control systems with communication. We first establish several computational complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-polynomial time and sample single-agent RL with partial observations, for tractably solving POSGs. Inspired by the inefficiency of planning in the ground-truth model, we then propose to further \emph{approximate} the shared common information to construct an approximate model of the POSG, in which an approximate \emph{equilibrium} (of the original POSG) can be found in quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable multi-agent RL algorithm whose time and sample complexities are \emph{both} quasi-polynomial. Finally, beyond equilibrium learning, we extend our algorithmic framework to finding the \emph{team-optimal solution} in cooperative POSGs, i.e., decentralized partially observable Markov decision processes, a more challenging goal. We establish concrete computational and sample complexities under several structural assumptions of the model. We hope our study could open up the possibilities of leveraging and even designing different \emph{information structures}, a well-studied notion in control theory, for developing both sample- and computation-efficient partially observable multi-agent RL.

Partially Observable Multi-Agent Reinforcement Learning with Information Sharing

TL;DR

This work studies provable MARL in partially observable POSGs under information-sharing models, showing that without structure the problem is intractable. It introduces an approximate common-information framework to compress shared history and enables quasi-polynomial planning and learning guarantees for equilibria and team-optima under several natural information-sharing structures. A meta-theoretic approach builds policy-dependent approximate models to achieve quasi-polynomial sample and time complexity, bridging planning, learning, and decentralized control theory. The results illuminate how information structure design can unlock tractable, scalable learning in multi-agent, partially observable settings and point to future work on fully decentralized regimes. The practical impact lies in informing the design of communication protocols and information-sharing schemes that yield provable efficiency in real-world MARL systems.

Abstract

We study provable multi-agent reinforcement learning (RL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we advocate leveraging the potential \emph{information-sharing} among agents, a common practice in empirical multi-agent RL, and a standard model for multi-agent control systems with communication. We first establish several computational complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-polynomial time and sample single-agent RL with partial observations, for tractably solving POSGs. Inspired by the inefficiency of planning in the ground-truth model, we then propose to further \emph{approximate} the shared common information to construct an approximate model of the POSG, in which an approximate \emph{equilibrium} (of the original POSG) can be found in quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable multi-agent RL algorithm whose time and sample complexities are \emph{both} quasi-polynomial. Finally, beyond equilibrium learning, we extend our algorithmic framework to finding the \emph{team-optimal solution} in cooperative POSGs, i.e., decentralized partially observable Markov decision processes, a more challenging goal. We establish concrete computational and sample complexities under several structural assumptions of the model. We hope our study could open up the possibilities of leveraging and even designing different \emph{information structures}, a well-studied notion in control theory, for developing both sample- and computation-efficient partially observable multi-agent RL.
Paper Structure (71 sections, 43 theorems, 81 equations, 2 figures, 1 table, 10 algorithms)

This paper contains 71 sections, 43 theorems, 81 equations, 2 figures, 1 table, 10 algorithms.

Key Result

Proposition 1

For zero-sum or cooperative POSGs with only information-sharing structures, or only Assumption observa, but not both, computing $\epsilon$-NE/CE/CCE is PSPACE-hard.

Figures (2)

  • Figure 1: Performance of MAPPO and IPPO in various delayed-sharing settings.
  • Figure 2: An overview of our algorithmic framework. The left part of the figure shows that there is a virtual coordinator collecting the information shared among agents. Based on the common information $c_h$, it will compute an equilibrium in the prescription space and assign it to all the agents. The right part shows the computation of equilibrium. Let's take the example of $A_i = 2$, $P_{i, h} = 3$, $C_h=2$. If we search over all deterministic prescriptions, the corresponding matrix game will have the size of $A_i^{C_hP_{i,h}} = 64$. Then, nayyar2013commonnayyar2013decentralized proposed the common information-based decomposition, and solve $C_h$ number of games of smaller size. However, in the Dec-POMDP setting, nayyar2013decentralized treated each deterministic prescription as an action and the size of each sub-problem will be $A_i^{P_{i, h}} = 8$. Furthermore, Proposition \ref{['prop:linear']} shows that we can reformulate each sub-problem as a game whose payoff is multi-linear with respect to each agent's prescription, and whose dimensionality is $A_iP_{i, h}=6$.

Theorems & Definitions (64)

  • Definition 1: Value function with information sharing
  • Definition 2: $\epsilon$-approximate Nash equilibrium with information sharing
  • Definition 3: $\epsilon$-approximate coarse correlated equilibrium with information sharing
  • Definition 4: $\epsilon$-approximate correlated equilibrium with information sharing
  • Definition 5: $\epsilon$-approximate team-optimum in Dec-POMDPs with information sharing
  • Example 1: One-step delayed sharing
  • Example 2: State controlled by one controller with asymmetric delay sharing
  • Example 3: Symmetric information game
  • Example 4: Information sharing with one-directional-one-step delay
  • Example 5: Uncontrolled state process
  • ...and 54 more