Table of Contents
Fetching ...

Networked Agents in the Dark: Team Value Learning under Partial Observability

Guilherme S. Varela, Alberto Sardinha, Francisco S. Melo

TL;DR

This work proposes a novel cooperative multi-agent reinforcement learning (MARL) approach for networked agents that increases the range of the possible applications of networked agents, being well-suited for real world domains that impose privacy and where the messages may not reach their recipients.

Abstract

We propose a novel cooperative multi-agent reinforcement learning (MARL) approach for networked agents. In contrast to previous methods that rely on complete state information or joint observations, our agents must learn how to reach shared objectives under partial observability. During training, they collect individual rewards and approximate a team value function through local communication, resulting in cooperative behavior. To describe our problem, we introduce the networked dynamic partially observable Markov game framework, where agents communicate over a switching topology communication network. Our distributed method, DNA-MARL, uses a consensus mechanism for local communication and gradient descent for local computation. DNA-MARL increases the range of the possible applications of networked agents, being well-suited for real world domains that impose privacy and where the messages may not reach their recipients. We evaluate DNA-MARL across benchmark MARL scenarios. Our results highlight the superior performance of DNA-MARL over previous methods.

Networked Agents in the Dark: Team Value Learning under Partial Observability

TL;DR

This work proposes a novel cooperative multi-agent reinforcement learning (MARL) approach for networked agents that increases the range of the possible applications of networked agents, being well-suited for real world domains that impose privacy and where the messages may not reach their recipients.

Abstract

We propose a novel cooperative multi-agent reinforcement learning (MARL) approach for networked agents. In contrast to previous methods that rely on complete state information or joint observations, our agents must learn how to reach shared objectives under partial observability. During training, they collect individual rewards and approximate a team value function through local communication, resulting in cooperative behavior. To describe our problem, we introduce the networked dynamic partially observable Markov game framework, where agents communicate over a switching topology communication network. Our distributed method, DNA-MARL, uses a consensus mechanism for local communication and gradient descent for local computation. DNA-MARL increases the range of the possible applications of networked agents, being well-suited for real world domains that impose privacy and where the messages may not reach their recipients. We evaluate DNA-MARL across benchmark MARL scenarios. Our results highlight the superior performance of DNA-MARL over previous methods.
Paper Structure (25 sections, 36 equations, 7 figures, 5 tables, 2 algorithms)

This paper contains 25 sections, 36 equations, 7 figures, 5 tables, 2 algorithms.

Figures (7)

  • Figure 1: Diagram illustrating the information flow from the algorithm, clockwise from the left.
  • Figure 2: From left to right, episodic returns for on-policy setting and episodic returns for the off-policy setting, for two selected tasks. We plot the 95% bootstrap CI for each algorithm. In chestnut, CTDE algorithms that establish the upper bound of performance. In grey, DTDE algorithms that are DNA-MARL's closest competitors. In orange, IL algorithms that establish a lower bound on performance. We can see that for three algorithms-environments combinations, (a), (b), (c), DNA-MARL (in blue) has the closest performance to the upper bound.
  • Figure 3: Ablation for DNAA2C: From left to right, DV (distributed-V) group has critic consensus. TV (team-V) group has team-$V$ consensus and critic consensus. DNA group has team-$V$ consensus and both actor and critic consensus. We can see a performance improvement moving from DV to TV which highlights the impact of our contribution.
  • Figure 4: Train rollouts for the on-policy algorithms. The bullets represent the average of evaluation checkpoints for ten random seeds. The shaded area represents a 95% bootstrap confidence interval around the average. In blue DNAA2C (ours), in orange INDA2C, the independent agents system. In gray, DVA2C, a distributed-$V$ algorithm zhang_2018. In chestnut MAA2C, the central-$V$, acts as an upper bound for performance. For the four instances, DNAA2C provides the best approximation for MAA2C.
  • Figure 5: Train rollouts for the off-policy algorithms. The bullets represent the average of evaluation checkpoints for ten random seeds. The shaded area represents a 95% bootstrap confidence interval around the average. In blue DNAQL (ours), in orange INDQL, the independent agents system. In gray, PIC a central-$Q$ algorithm that approximates chen_2022. In chestnut VDN, that factorized representations for a central $Q$. For the LBF scenarios DNAQL outperforms other decentralized approaches. For MPE scenarios, (d) and (e), it outperforms VDN.
  • ...and 2 more figures