Table of Contents
Fetching ...

MA2RL: Masked Autoencoders for Generalizable Multi-Agent Reinforcement Learning

Jinyuan Feng, Min Chen, Zhiqiang Pu, Yifan Xu, Yanyan Liang

TL;DR

MA2RL addresses generalization in decentralized partially observable MARL by extending masked autoencoders to an entity-centric framework. It employs two variational autoencoders to encode observed entities and global states, dynamically infers latent representations of masked entities, and uses an attentive action decoder with a skill token to produce actions, achieving strong zero-shot and transfer performance. The method yields state-of-the-art asymptotic results on challenging tasks, improved sample efficiency, and robust generalization across single-task and multi-task settings. This framework offers a scalable path toward generalizable coordination in multi-agent systems, with potential impact on real-world cooperative robotics and autonomous networks.

Abstract

To develop generalizable models in multi-agent reinforcement learning, recent approaches have been devoted to discovering task-independent skills for each agent, which generalize across tasks and facilitate agents' cooperation. However, particularly in partially observed settings, such approaches struggle with sample efficiency and generalization capabilities due to two primary challenges: (a) How to incorporate global states into coordinating the skills of different agents? (b) How to learn generalizable and consistent skill semantics when each agent only receives partial observations? To address these challenges, we propose a framework called \textbf{M}asked \textbf{A}utoencoders for \textbf{M}ulti-\textbf{A}gent \textbf{R}einforcement \textbf{L}earning (MA2RL), which encourages agents to infer unobserved entities by reconstructing entity-states from the entity perspective. The entity perspective helps MA2RL generalize to diverse tasks with varying agent numbers and action spaces. Specifically, we treat local entity-observations as masked contexts of the global entity-states, and MA2RL can infer the latent representation of dynamically masked entities, facilitating the assignment of task-independent skills and the learning of skill semantics. Extensive experiments demonstrate that MA2RL achieves significant improvements relative to state-of-the-art approaches, demonstrating extraordinary performance, remarkable zero-shot generalization capabilities and advantageous transferability.

MA2RL: Masked Autoencoders for Generalizable Multi-Agent Reinforcement Learning

TL;DR

MA2RL addresses generalization in decentralized partially observable MARL by extending masked autoencoders to an entity-centric framework. It employs two variational autoencoders to encode observed entities and global states, dynamically infers latent representations of masked entities, and uses an attentive action decoder with a skill token to produce actions, achieving strong zero-shot and transfer performance. The method yields state-of-the-art asymptotic results on challenging tasks, improved sample efficiency, and robust generalization across single-task and multi-task settings. This framework offers a scalable path toward generalizable coordination in multi-agent systems, with potential impact on real-world cooperative robotics and autonomous networks.

Abstract

To develop generalizable models in multi-agent reinforcement learning, recent approaches have been devoted to discovering task-independent skills for each agent, which generalize across tasks and facilitate agents' cooperation. However, particularly in partially observed settings, such approaches struggle with sample efficiency and generalization capabilities due to two primary challenges: (a) How to incorporate global states into coordinating the skills of different agents? (b) How to learn generalizable and consistent skill semantics when each agent only receives partial observations? To address these challenges, we propose a framework called \textbf{M}asked \textbf{A}utoencoders for \textbf{M}ulti-\textbf{A}gent \textbf{R}einforcement \textbf{L}earning (MA2RL), which encourages agents to infer unobserved entities by reconstructing entity-states from the entity perspective. The entity perspective helps MA2RL generalize to diverse tasks with varying agent numbers and action spaces. Specifically, we treat local entity-observations as masked contexts of the global entity-states, and MA2RL can infer the latent representation of dynamically masked entities, facilitating the assignment of task-independent skills and the learning of skill semantics. Extensive experiments demonstrate that MA2RL achieves significant improvements relative to state-of-the-art approaches, demonstrating extraordinary performance, remarkable zero-shot generalization capabilities and advantageous transferability.

Paper Structure

This paper contains 18 sections, 12 equations, 11 figures, 4 tables, 3 algorithms.

Figures (11)

  • Figure 1: Illustration of MAE in different domains. (a) Masked autoencoding in NLP. (b) Masked autoencoding in CV. (c) Masked autoencoding in the context of generalization in MARL.
  • Figure 2: Simplified schematic diagram of typical methods for applying MAE in multi-agent systems (MA2CL, MaskMA and MA2RL). The figure shows a comparison of masking between MA2CL, MaskMA, and MA2RL. (a) random masking in MA2CL. (b) random masking in MaskMA. (c) dynamic masking in MA2RL
  • Figure 3: The network structure of MA2RL. (a) The overall architecture. (b) The stucture of variational autoencoder (VAE). (c) The details of masked autoencoder for MARL, where entity-observations can be regarded as a mask of the entity-states. (d) The attentive Action decoder that reuses the decoder in VAE to infer masked entity-observations for better action execution.
  • Figure 4: attentive action decoder. The attentive action decoder utilizes the latent representations of all masked entity-observations to infer the information of masked entities and then applies a skill attention module to obtain actions.
  • Figure 5: The performance of MA2RL and baselines, including DT2GS, UPDeT, ASN_G, and MAPPO, are compared in the Single-Task settings. The evaluation is conducted on 2 hard tasks (5m_vs_6m, 3s_vs_5z) and 2 superhard tasks (3s5z_vs_3s6z, 6h_vs_8z).
  • ...and 6 more figures