Table of Contents
Fetching ...

Q-MARL: A quantum-inspired algorithm using neural message passing for large-scale multi-agent reinforcement learning

Kha Vo, Chin-Teng Lin

TL;DR

Q-MARL addresses the scalability bottleneck in multi-agent reinforcement learning by fully decentralising training through graph-based sub-graphs and neural message passing. Each agent is treated as the center of a dynamic local neighbourhood, and actions are ensembled across all sub-graphs containing the agent, enabling efficient learning with thousands of agents without requiring common rewards or fixed agent order. The framework provides theoretical convergence guarantees under time-varying graphs and demonstrates dramatic improvements in training speed and loss, with strong generalisation across Jungle, Battle, and Deception scenarios compared to contemporary graph-based MARL methods. This approach offers a scalable, decentralized solution for large-scale cooperative-competitive MARL tasks, with potential impact on complex multi-agent systems and distributed decision-making.

Abstract

Inspired by a graph-based technique for predicting molecular properties in quantum chemistry -- atoms' position within molecules in three-dimensional space -- we present Q-MARL, a completely decentralised learning architecture that supports very large-scale multi-agent reinforcement learning scenarios without the need for strong assumptions like common rewards or agent order. The key is to treat each agent as relative to its surrounding agents in an environment that is presumed to change dynamically. Hence, in each time step, an agent is the centre of its own neighbourhood and also a neighbour to many other agents. Each role is formulated as a sub-graph, and each sub-graph is used as a training sample. A message-passing neural network supports full-scale vertex and edge interaction within a local neighbourhood, while a parameter governing the depth of the sub-graphs eases the training burden. During testing, an agent's actions are locally ensembled across all the sub-graphs that contain it, resulting in robust decisions. Where other approaches struggle to manage 50 agents, Q-MARL can easily marshal thousands. A detailed theoretical analysis proves improvement and convergence, and simulations with the typical collaborative and competitive scenarios show dramatically faster training speeds and reduced training losses.

Q-MARL: A quantum-inspired algorithm using neural message passing for large-scale multi-agent reinforcement learning

TL;DR

Q-MARL addresses the scalability bottleneck in multi-agent reinforcement learning by fully decentralising training through graph-based sub-graphs and neural message passing. Each agent is treated as the center of a dynamic local neighbourhood, and actions are ensembled across all sub-graphs containing the agent, enabling efficient learning with thousands of agents without requiring common rewards or fixed agent order. The framework provides theoretical convergence guarantees under time-varying graphs and demonstrates dramatic improvements in training speed and loss, with strong generalisation across Jungle, Battle, and Deception scenarios compared to contemporary graph-based MARL methods. This approach offers a scalable, decentralized solution for large-scale cooperative-competitive MARL tasks, with potential impact on complex multi-agent systems and distributed decision-making.

Abstract

Inspired by a graph-based technique for predicting molecular properties in quantum chemistry -- atoms' position within molecules in three-dimensional space -- we present Q-MARL, a completely decentralised learning architecture that supports very large-scale multi-agent reinforcement learning scenarios without the need for strong assumptions like common rewards or agent order. The key is to treat each agent as relative to its surrounding agents in an environment that is presumed to change dynamically. Hence, in each time step, an agent is the centre of its own neighbourhood and also a neighbour to many other agents. Each role is formulated as a sub-graph, and each sub-graph is used as a training sample. A message-passing neural network supports full-scale vertex and edge interaction within a local neighbourhood, while a parameter governing the depth of the sub-graphs eases the training burden. During testing, an agent's actions are locally ensembled across all the sub-graphs that contain it, resulting in robust decisions. Where other approaches struggle to manage 50 agents, Q-MARL can easily marshal thousands. A detailed theoretical analysis proves improvement and convergence, and simulations with the typical collaborative and competitive scenarios show dramatically faster training speeds and reduced training losses.

Paper Structure

This paper contains 16 sections, 3 theorems, 17 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

The gradient of the global reward $\lambda(\bm\theta)$ with respect to any local policy parameter $\bm\theta^{i}$ of agent $i$ can be derived as where $\mathbb E_\pi$ denotes the expectation of the random variables inside the brackets following the joint policy $\pi$.

Figures (9)

  • Figure 1: Inspiration for MARL from quantum chemistry. Top row: illustrations of five differently placed counterparts of a same molecule $C_4H_5ON$. The scalar coupling constantCHAMPSKaggle between any pair of atoms in the molecule is invariant with respect to the coordination of the molecule. Bottom row: illustration of five differently placed scenes of the same relative structure in a 4-player Halite game HaliteHalite4. The optimal action of each agent should be invariant with respect to the scene's rotation/symmetry.
  • Figure 2: An example of how the graph decomposition process differs for two different time steps. The blue and orange vertices represent two homogeneous groups of agents, i.e., two teams. There are 12 sub-graphs for each time step, each formed from the perspective of a different individual agent as indicated by the green vertex, and extending to its 3rd-degree neighbours. This depth of degree, and therefore the complexity of the model, is controlled by a hyperparameter.
  • Figure 3: The proposed neural message-passing (NMP) architecture. The full architecture consists of recurring vertex update (V) blocks and edge update (E) blocks. The smaller blocks (fc, rl, concat, ebd, rbe, $\cdot$, and +) are described in Section \ref{['sec:architecture']}.
  • Figure 4: Illustrations of the three MARL scenarios considered in this paper. From left to right: Jungle, Battle, Deception.
  • Figure 5: Illustration of typical trained behaviour in the three MARL scenarios, described in detail in the text of Section \ref{['sec:result']}.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof