Representation Learning For Efficient Deep Multi-Agent Reinforcement Learning
Dom Huh, Prasant Mohapatra
TL;DR
The paper addresses the persistent challenge of sample inefficiency in deep multi-agent reinforcement learning by introducing MAPO-LSO, a framework that augments MARL with latent-space optimization. It decomposes MA-LSO into MA-TDR, which reconstructs multi-agent transition dynamics using recurrent modeling and predictive representations, and MA-SPL, which enforces self-predictive consistency across agents through masked reconstruction, forward and inverse dynamics, and dedicated MLP heads. The approach can be integrated with existing MARL algorithms (e.g., MAPPO, HAPPO, MASAC, MADDPG) with minimal changes and demonstrates substantial gains in both convergence speed and data efficiency across VMAS and IsaacTeams benchmarks; ablations show that all components contribute and are interdependent, with pre-training and uncertainty modeling further enhancing performance. Overall, MAPO-LSO offers a scalable, generalizable enhancement to MARL by enriching the latent state space with dynamics-aware, jointly-consistent representations, potentially transforming practical efficiency in multi-agent domains.
Abstract
Sample efficiency remains a key challenge in multi-agent reinforcement learning (MARL). A promising approach is to learn a meaningful latent representation space through auxiliary learning objectives alongside the MARL objective to aid in learning a successful control policy. In our work, we present MAPO-LSO (Multi-Agent Policy Optimization with Latent Space Optimization) which applies a form of comprehensive representation learning devised to supplement MARL training. Specifically, MAPO-LSO proposes a multi-agent extension of transition dynamics reconstruction and self-predictive learning that constructs a latent state optimization scheme that can be trivially extended to current state-of-the-art MARL algorithms. Empirical results demonstrate MAPO-LSO to show notable improvements in sample efficiency and learning performance compared to its vanilla MARL counterpart without any additional MARL hyperparameter tuning on a diverse suite of MARL tasks.
