Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

John Gardiner; Orlando Romero; Brendan Tivnan; Nicolò Dal Fabbro; George J. Pappas

Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

John Gardiner, Orlando Romero, Brendan Tivnan, Nicolò Dal Fabbro, George J. Pappas

TL;DR

The paper tackles coordination under partial observability without communication in multi-agent reinforcement learning by leveraging quantum entanglement as a coordination resource. It introduces QuantumSoftmax to differentiably parameterize quantum measurements and a quantum-coordinator/advisor policy architecture that decouples coordination from local decision-making, integrated into a modified MAPPO algorithm. Empirically, the framework learns strategies exhibiting quantum advantage in single-round nonlocal games and in a Dec-POMDP-based multi-router queueing task, outperforming classical shared-randomness baselines in many settings. The work demonstrates a principled, gradient-based approach to incorporating quantum resources into MARL with potential practical impact as quantum hardware matures, and outlines directions for extending the framework to multi-round, imperfect hardware scenarios and more compact policy representations.

Abstract

The inability to communicate poses a major challenge to coordination in multi-agent reinforcement learning (MARL). Prior work has explored correlating local policies via shared randomness, sometimes in the form of a correlation device, as a mechanism to assist in decentralized decision-making. In contrast, this work introduces the first framework for training MARL agents to exploit shared quantum entanglement as a coordination resource, which permits a larger class of communication-free correlated policies than shared randomness alone. This is motivated by well-known results in quantum physics which posit that, for certain single-round cooperative games with no communication, shared quantum entanglement enables strategies that outperform those that only use shared randomness. In such cases, we say that there is quantum advantage. Our framework is based on a novel differentiable policy parameterization that enables optimization over quantum measurements, together with a novel policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors. To illustrate the effectiveness of our proposed method, we first show that we can learn, purely from experience, strategies that attain quantum advantage in single-round games that are treated as black box oracles. We then demonstrate how our machinery can learn policies with quantum advantage in an illustrative multi-agent sequential decision-making problem formulated as a decentralized partially observable Markov decision process (Dec-POMDP).

Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

TL;DR

Abstract

Paper Structure (45 sections, 6 theorems, 41 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 45 sections, 6 theorems, 41 equations, 8 figures, 2 tables, 1 algorithm.

Introduction
Relevant Quantum Theory
Quantum systems and measurements
Joint systems and quantum entanglement
Communication, Coordination, and Entangled Policies
Learning to Coordinate via Quantum Entanglement in MARL
Parameterizing Shared Entanglement Policies
Learning Quantum Entangled Strategies
Policy Gradient for Nonlocal Games
Multi-Agent PPO for Sequential Decision-Making
Experiments
Nonlocal Games
Multi-Agent Sequential Decision-Making
Multi-Router Multi-Server Queueing
Conclusions and Future Work
...and 30 more sections

Key Result

Proposition 1

$\boldsymbol{\Pi}_{\mathsf{SR}}$ is the convex hull of $\boldsymbol{\Pi}_{\mathsf{F}}$. If $\boldsymbol{\mathcal{H}}$ and $\boldsymbol{\mathcal{A}}$ are finite, then $\boldsymbol{\Pi}_{\mathsf{F}}$ and $\boldsymbol{\Pi}_{\mathsf{SR}}$ can be represented as subsets of the Euclidean space $\mathbb{R}^

Figures (8)

Figure 1: Hierarchy of policies. Here, $\boldsymbol{\Pi}_{\mathsf{F}}$ is the space of factorized policies, $\boldsymbol{\Pi}_{\mathsf{SR}}$ the space of shared randomness policies, $\boldsymbol{\Pi}_{\mathsf{Q}}$ the space of shared (quantum) entanglement policies, $\boldsymbol{\Pi}_{\mathsf{NS}}$ the space of non-signaling policies, and $\boldsymbol{\Pi}_{\mathsf{C}}$ the space of all joint policies.
Figure 2: Decentralized and parameterized implementation of a joint policy with shared quantum entanglement.
Figure 3: Win probabilities of learned strategies for the CHSH game during training with and without entropy regularization.
Figure 4: The multi-router queueing problem
Figure 5: Wait time in excess of the optimal wait time obtainable when communication is allowed. Lower is better. Learned strategies with shared quantum entanglement (orange dots) outperform the theoretical best for non-entangled strategies (blue line) for most values of the throughput.
...and 3 more figures

Theorems & Definitions (9)

Definition 1: Density matrix
Definition 2: POVM
Definition 3: Entanglement
Proposition 1
Proposition 2
Proposition 3
Proposition 4
Proposition 5
Proposition 6

Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

TL;DR

Abstract

Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (9)