Principled Learning-to-Communicate with Quasi-Classical Information Structures

Xiangyu Liu; Haoyi You; Kaiqing Zhang

Principled Learning-to-Communicate with Quasi-Classical Information Structures

Xiangyu Liu, Haoyi You, Kaiqing Zhang

TL;DR

This paper formalizes LTC in decentralized partially observable Markov decision processes (Dec-POMDPs) under the common-information-based framework from decentralized stochastic control, and classify LTC problems based on the ISs before (additional) information sharing.

Abstract

Learning-to-communicate (LTC) in partially observable environments has received increasing attention in deep multi-agent reinforcement learning, where the control and communication strategies are jointly learned. Meanwhile, the impact of communication on decision-making has been extensively studied in control theory. In this paper, we seek to formalize and better understand LTC by bridging these two lines of work, through the lens of information structures (ISs). To this end, we formalize LTC in decentralized partially observable Markov decision processes (Dec-POMDPs) under the common-information-based framework from decentralized stochastic control, and classify LTC problems based on the ISs before (additional) information sharing. We first show that non-classical LTCs are computationally intractable in general, and thus focus on quasi-classical (QC) LTCs. We then propose a series of conditions for QC LTCs, under which LTCs preserve the QC IS after information sharing, whereas violating which can cause computational hardness in general. Further, we develop provable planning and learning algorithms for QC LTCs, and establish quasi-polynomial time and sample complexities for several QC LTC examples that satisfy the above conditions. Along the way, we also establish results on the relationship between (strictly) QC IS and the condition of having strategy-independent common-information-based beliefs (SI-CIBs), as well as on solving Dec-POMDPs without computationally intractable oracles but beyond those with SI-CIBs, which may be of independent interest.

Principled Learning-to-Communicate with Quasi-Classical Information Structures

TL;DR

Abstract

Paper Structure (56 sections, 27 theorems, 90 equations, 5 figures, 2 tables, 6 algorithms)

This paper contains 56 sections, 27 theorems, 90 equations, 5 figures, 2 tables, 6 algorithms.

Introduction
Contributions.
Related Work
Communication-control joint optimization.
Information sharing and information structures.
Partially observable MARL theory.
Preliminaries
Learning-to-Communicate Formulation
Decision-making components
Communication components
System evolution
Communication step:
Decision-making step:
Strategies and solution concept
Information Structures of LTC
...and 41 more sections

Key Result

Lemma III.2

For non-classical LTCs under Assumptions gamma observability, limited communication strategy, useless action, and weak gamma observability, finding an $\frac{\epsilon}{H}$-team optimum is PSPACE-hard.

Figures (5)

Figure 1: Illustrating the subroutines \ref{['main algorithm']} for solving the LTC problems.
Figure 2: (a) Venn diagram of LTCs with different ISs: ① QC LTCs. ② QC LTCs satisfying Assumptions \ref{['limited communication strategy']}, \ref{['useless action']}, and \ref{['weak gamma observability']}. ③ sQC LTCs. ④ sQC LTCs satisfying Assumptions \ref{['limited communication strategy']}, \ref{['useless action']}, and \ref{['weak gamma observability']}, whose reformulated Dec-POMDPs have SI-CIBs (and can thus be solved without computationally intractable oracles); (b) Venn diagram of general Dec-POMDPs with different ISs. PR denotes perfect recall. We construct examples for each area in § \ref{['sec:examples_venn_diag']}.
Figure 3: The time-average values achieved under different communication costs and horizons. For each bar, the dark portion and the light portion correspond to the values associated with the communication cost and the overall objective (reward minus cost) of the agents, respectively; the full bar corresponds to the values associated with the reward.
Figure 4: Learning curves, i.e., the values associated with the overall objective (reward minus cost) achieved during learning, under different communication costs.
Figure 5: Illustrating the learning-to-communicate problem considered in this paper.

Theorems & Definitions (71)

Definition II.3: $\epsilon$-team optimum
Definition II.4: Dec-POMDP (with information sharing) induced by LTC
Definition II.5: (Strictly) quasi-classical LTC
Lemma III.2: Non-classical LTCs are hard
Lemma III.3: QC LTCs with full-history-dependent communication strategies are hard
Lemma III.6: QC LTCs without Assumption \ref{['useless action']} are hard
Lemma III.8: QC LTCs without Assumption \ref{['weak gamma observability']} are hard
Proposition IV.1: Equivalence between $\mathcal{L}$ and $\mathcal{D}_\mathcal{L}$
Theorem IV.2: Preserving (s)QC
Lemma IV.3
...and 61 more

Principled Learning-to-Communicate with Quasi-Classical Information Structures

TL;DR

Abstract

Principled Learning-to-Communicate with Quasi-Classical Information Structures

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (71)