Table of Contents
Fetching ...

Fully-Decentralized MADDPG with Networked Agents

Diego Bolliger, Lorenz Zauter, Robert Ziegler

TL;DR

This work tackles fully decentralized multi-agent reinforcement learning in partially observable stochastic games (POSG) by adapting MADDPG to operate with local information during both training and execution. It first introduces a fully decentralized MADDPG using surrogate policies to approximate other agents' behavior and local replay buffers, then adds a networked training paradigm with either hard consensus (averaging neighbors' critics) or soft consensus (penalized parameter alignment) to balance cooperation and decentralization. The authors extend these methods to adversarial and mixed settings by adjusting the gradient updates to account for potential adversaries and observations of their actions. Empirical results in the multi-particle environment show that the decentralized variants achieve comparable performance to MADDPG while reducing computational cost, with soft consensus offering better stability, especially as the agent count grows, and faster convergence than fully centralized approaches in larger-scale scenarios. These findings demonstrate scalable decentralized MARL with local observations and point to future work on applying the approach to other algorithms like MAPPO to further improve performance in large, networked teams.

Abstract

In this paper, we devise three actor-critic algorithms with decentralized training for multi-agent reinforcement learning in cooperative, adversarial, and mixed settings with continuous action spaces. To this goal, we adapt the MADDPG algorithm by applying a networked communication approach between agents. We introduce surrogate policies in order to decentralize the training while allowing for local communication during training. The decentralized algorithms achieve comparable results to the original MADDPG in empirical tests, while reducing computational cost. This is more pronounced with larger numbers of agents.

Fully-Decentralized MADDPG with Networked Agents

TL;DR

This work tackles fully decentralized multi-agent reinforcement learning in partially observable stochastic games (POSG) by adapting MADDPG to operate with local information during both training and execution. It first introduces a fully decentralized MADDPG using surrogate policies to approximate other agents' behavior and local replay buffers, then adds a networked training paradigm with either hard consensus (averaging neighbors' critics) or soft consensus (penalized parameter alignment) to balance cooperation and decentralization. The authors extend these methods to adversarial and mixed settings by adjusting the gradient updates to account for potential adversaries and observations of their actions. Empirical results in the multi-particle environment show that the decentralized variants achieve comparable performance to MADDPG while reducing computational cost, with soft consensus offering better stability, especially as the agent count grows, and faster convergence than fully centralized approaches in larger-scale scenarios. These findings demonstrate scalable decentralized MARL with local observations and point to future work on applying the approach to other algorithms like MAPPO to further improve performance in large, networked teams.

Abstract

In this paper, we devise three actor-critic algorithms with decentralized training for multi-agent reinforcement learning in cooperative, adversarial, and mixed settings with continuous action spaces. To this goal, we adapt the MADDPG algorithm by applying a networked communication approach between agents. We introduce surrogate policies in order to decentralize the training while allowing for local communication during training. The decentralized algorithms achieve comparable results to the original MADDPG in empirical tests, while reducing computational cost. This is more pronounced with larger numbers of agents.

Paper Structure

This paper contains 27 sections, 1 theorem, 20 equations, 6 figures, 7 algorithms.

Key Result

Theorem 1

Let $J(\theta_i) = \mathbb{E}_{s\sim P, a\sim\pi_{\theta_i}}[R^i]$ the expected reward of each agent $i$. Then we can write the gradient of the policy as: where $\textbf{x} = (o_1,\ldots, o_{N})$ are the observations of all agents, $Q_i^\pi(\textbf{x}, a_1,\ldots, a_{N})$ is a centralized action-value function, $\pi_{\theta_i}(a_i|o_i)$ is the parameterized policy of agent $i$ utilizing only loca

Figures (6)

  • Figure 1: Comparison of evaluation score averaged over 100 test episodes for the simple spread environment between standard MADDPG, the fully decentralized MADDPG, and the fully decentralized MADDPG with hard and soft consensus update.
  • Figure 2: Comparison of evaluation scores with four and five agents for the simple spread environment. Training was done for two agents (\ref{['subfig:four_agents_ap']}) over $400000$ steps and for five agents (\ref{['subfig:five_agents_ap']}) over $500000$.
  • Figure 3: Comparison of evaluation scores with less connectivity over training steps for the simple spread environment between standard MADDPG (blue), the fully decentralized MADDPG (green) and the fully decentralized MADDPG with hard (red) and soft (orange) consensus update. Training was done for two agents (\ref{['subfig:two_agents']}) over $100000$ steps and for three agents (\ref{['subfig:three_agents']}) over $300000$.
  • Figure 4: Comparison of evaluation scores with ten agents for the simple spread environment with a fully connected communication graph (\ref{['subfig:fully_connected']}) and a circle (\ref{['subfig:circle_graph']}).
  • Figure 5: Performance of the algorithms in an adversarial setting of one agent against one. The two agents are shown in separate plots.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1: POSG
  • Definition 2: POSG with networked agents
  • Theorem 1: Policy gradient theorem