Table of Contents
Fetching ...

Distributed Value Decomposition Networks with Networked Agents

Guilherme S. Varela, Alberto Sardinha, Francisco S. Melo

TL;DR

This paper tackles learning in cooperative multi-agent systems under partial observability without centralized training. It introduces DVDN, a decentralized approach that decomposes a joint Q-function into agent-wise components and uses peer-to-peer TD consensus to approximate the joint temporal difference; gradient tracking is added for homogeneous agents to align parameters and gradients. Empirical results across ten DTDE MARL tasks show that DVDN can match or exceed the performance of centralized VDN in many heterogeneous scenarios, with JTD consensus providing robust gains and gradient tracking offering additional benefits in specific settings. The work advances practical distributed MARL by enabling effective learning through local communication and consensus while addressing the non-stationarity challenges of decentralized training, with potential impact on real-world multi-robot and networked systems.

Abstract

We investigate the problem of distributed training under partial observability, whereby cooperative multi-agent reinforcement learning agents (MARL) maximize the expected cumulative joint reward. We propose distributed value decomposition networks (DVDN) that generate a joint Q-function that factorizes into agent-wise Q-functions. Whereas the original value decomposition networks rely on centralized training, our approach is suitable for domains where centralized training is not possible and agents must learn by interacting with the physical environment in a decentralized manner while communicating with their peers. DVDN overcomes the need for centralized training by locally estimating the shared objective. We contribute with two innovative algorithms, DVDN and DVDN (GT), for the heterogeneous and homogeneous agents settings respectively. Empirically, both algorithms approximate the performance of value decomposition networks, in spite of the information loss during communication, as demonstrated in ten MARL tasks in three standard environments.

Distributed Value Decomposition Networks with Networked Agents

TL;DR

This paper tackles learning in cooperative multi-agent systems under partial observability without centralized training. It introduces DVDN, a decentralized approach that decomposes a joint Q-function into agent-wise components and uses peer-to-peer TD consensus to approximate the joint temporal difference; gradient tracking is added for homogeneous agents to align parameters and gradients. Empirical results across ten DTDE MARL tasks show that DVDN can match or exceed the performance of centralized VDN in many heterogeneous scenarios, with JTD consensus providing robust gains and gradient tracking offering additional benefits in specific settings. The work advances practical distributed MARL by enabling effective learning through local communication and consensus while addressing the non-stationarity challenges of decentralized training, with potential impact on real-world multi-robot and networked systems.

Abstract

We investigate the problem of distributed training under partial observability, whereby cooperative multi-agent reinforcement learning agents (MARL) maximize the expected cumulative joint reward. We propose distributed value decomposition networks (DVDN) that generate a joint Q-function that factorizes into agent-wise Q-functions. Whereas the original value decomposition networks rely on centralized training, our approach is suitable for domains where centralized training is not possible and agents must learn by interacting with the physical environment in a decentralized manner while communicating with their peers. DVDN overcomes the need for centralized training by locally estimating the shared objective. We contribute with two innovative algorithms, DVDN and DVDN (GT), for the heterogeneous and homogeneous agents settings respectively. Empirically, both algorithms approximate the performance of value decomposition networks, in spite of the information loss during communication, as demonstrated in ten MARL tasks in three standard environments.

Paper Structure

This paper contains 27 sections, 36 equations, 16 figures, 7 tables, 2 algorithms.

Figures (16)

  • Figure 1: In the columns, the results for each environment is expressed by a representative task. In the top row, heterogeneous agents setting and in the bottom row homogeneous agents setting. The IQL curve is orange, VDN curve is chestnut and the DVDN curve is blue. The markers represent the evaluation checkpoint and the shaded area represent the 95% bootstrap CIs. Notably, the performance curves for VDN and DVDN are similar, showcasing the effectiveness of using JTD as training signal.
  • Figure 2: Ablation plots for the homogeneous setting, for the LBF and MARBLER environments respectively. The IQL group has no consensus (control group). The GT group performs gradient tracking. The JTD group performs joint temporal difference consensus. The GT+JTD group combines GT and JTD consensus. For the three tasks both factors individually improve results and are better combined.
  • Figure 3: To the left, the value decomposition diagram where the value decomposition layer performs addition. To the right, the temporal difference consensus agents user peer-to-peer communication over an arbitrary strongly connected graph. Bidirectional arrows indicate the last step of forward pass (a) or TD consensus (b). The dashed line represent back-propagation algorithm.
  • Figure 4: The remaining performance plots of the algorithms for heterogeneous agents in the LBF environment, with IQL represented in orange, VDN in chestnut, DVDN in blue. The markers represent the average episodic returns and the shaded area represent the 95% bootstrap CIs. All algorithms have about the same performance. Particularly, DVDN's learning curve is similar to VDN's.
  • Figure 5: Ablation plots for heterogeneous agents in the LBF environment. The IQL (control) group has no consensus, while JTD group has joint temporal difference consensus. Notably, JTD consensus leads to significant improvement in results across tasks.
  • ...and 11 more figures