Table of Contents
Fetching ...

GAWM: Global-Aware World Model for Multi-Agent Reinforcement Learning

Zifeng Shi, Meiqin Liu, Senlin Zhang, Ronghao Zheng, Shanling Dong, Ping Wei

TL;DR

A model-based MARL method called GAWM is proposed, which enhances the centralized world model's ability to achieve globally unified and accurate representation of state information while adhering to the CTDE paradigm, leading to superior convergence performance, particularly in complex and challenging multi-agent environments.

Abstract

In recent years, Model-based Multi-Agent Reinforcement Learning (MARL) has demonstrated significant advantages over model-free methods in terms of sample efficiency by using independent environment dynamics world models for data sample augmentation. However, without considering the limited sample size, these methods still lag behind model-free methods in terms of final convergence performance and stability. This is primarily due to the world model's insufficient and unstable representation of global states in partially observable environments. This limitation hampers the ability to ensure global consistency in the data samples and results in a time-varying and unstable distribution mismatch between the pseudo data samples generated by the world model and the real samples. This issue becomes particularly pronounced in more complex multi-agent environments. To address this challenge, we propose a model-based MARL method called GAWM, which enhances the centralized world model's ability to achieve globally unified and accurate representation of state information while adhering to the CTDE paradigm. GAWM uniquely leverages an additional Transformer architecture to fuse local observation information from different agents, thereby improving its ability to extract and represent global state information. This enhancement not only improves sample efficiency but also enhances training stability, leading to superior convergence performance, particularly in complex and challenging multi-agent environments. This advancement enables model-based methods to be effectively applied to more complex multi-agent environments. Experimental results demonstrate that GAWM outperforms various model-free and model-based approaches, achieving exceptional performance in the challenging domains of SMAC.

GAWM: Global-Aware World Model for Multi-Agent Reinforcement Learning

TL;DR

A model-based MARL method called GAWM is proposed, which enhances the centralized world model's ability to achieve globally unified and accurate representation of state information while adhering to the CTDE paradigm, leading to superior convergence performance, particularly in complex and challenging multi-agent environments.

Abstract

In recent years, Model-based Multi-Agent Reinforcement Learning (MARL) has demonstrated significant advantages over model-free methods in terms of sample efficiency by using independent environment dynamics world models for data sample augmentation. However, without considering the limited sample size, these methods still lag behind model-free methods in terms of final convergence performance and stability. This is primarily due to the world model's insufficient and unstable representation of global states in partially observable environments. This limitation hampers the ability to ensure global consistency in the data samples and results in a time-varying and unstable distribution mismatch between the pseudo data samples generated by the world model and the real samples. This issue becomes particularly pronounced in more complex multi-agent environments. To address this challenge, we propose a model-based MARL method called GAWM, which enhances the centralized world model's ability to achieve globally unified and accurate representation of state information while adhering to the CTDE paradigm. GAWM uniquely leverages an additional Transformer architecture to fuse local observation information from different agents, thereby improving its ability to extract and represent global state information. This enhancement not only improves sample efficiency but also enhances training stability, leading to superior convergence performance, particularly in complex and challenging multi-agent environments. This advancement enables model-based methods to be effectively applied to more complex multi-agent environments. Experimental results demonstrate that GAWM outperforms various model-free and model-based approaches, achieving exceptional performance in the challenging domains of SMAC.
Paper Structure (28 sections, 6 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 6 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: The current mainstream world models adopt a centralized state-transition prediction and distributed state-reconstruction framework. In this approach, the inputs for state-transition prediction include global latent state variables and action information, while the current state representation and reconstruction rely solely on locally observable state information. Due to the inherent limitations of partial observability, each agent's local observations provide only a fragmented view of the global state, making it difficult to accurately predict and represent global information. Consequently, this limitation may lead to inconsistencies in the reconstructed state information (e.g., rewards, observations, and discount factors) and cause conflicts in global consistency.
  • Figure 2: Dual Experience Replay Buffer structure.
  • Figure 3: Comparisons with other baselines. The solid line represents the running average of 3 different random seeds, and the shaded area corresponds to the winning rate/episode rewards range for different seeds at the same time. The X-axis represents the number of steps taken in the real environment, and the Y-axis represents the win rate (SMAC).
  • Figure 4: Training loss curve for the world model. The solid line represents the running average of 3 different random seeds, and the shaded area corresponds to the loss range for different seeds at the same time. The X-axis represents the number of training epochs of world model, and the Y-axis represents the loss value.
  • Figure 5: Win rate curve for ablation experiments. The solid line represents the running average of 3 different random seeds, and the shaded area corresponds to the winning rate range for different seeds at the same time. The X-axis represents the number of steps taken in the real environment, and the Y-axis represents the win rate (SMAC).