Table of Contents
Fetching ...

MARLadona -- Towards Cooperative Team Play Using Multi-Agent Reinforcement Learning

Zichong Li, Filip Bjelonic, Victor Klemm, Marco Hutter

TL;DR

The paper addresses the challenge of learning cooperative policies in multi-agent robot soccer by introducing MARLadona, a decentralized MARL framework built on a physics-based Isaac Gym environment. It combines a permutation-aware global entity encoder with end-to-end training and curricula, enabling scalable team play up to 11v11 through self-play and CTDE. The framework achieves a 66.8% win rate against the HELIOS benchmark and demonstrates strong generalization to larger team sizes, accompanied by detailed policy analyses and qualitative behavior insights. This work advances end-to-end MARL in physically grounded multi-agent soccer and provides an open-source environment and architecture to accelerate research and benchmarking.

Abstract

Robot soccer, in its full complexity, poses an unsolved research challenge. Current solutions heavily rely on engineered heuristic strategies, which lack robustness and adaptability. Deep reinforcement learning has gained significant traction in various complex robotics tasks such as locomotion, manipulation, and competitive games (e.g., AlphaZero, OpenAI Five), making it a promising solution to the robot soccer problem. This paper introduces MARLadona. A decentralized multi-agent reinforcement learning (MARL) training pipeline capable of producing agents with sophisticated team play behavior, bridging the shortcomings of heuristic methods. Furthermore, we created an open-source multi-agent soccer environment. Utilizing our MARL framework and a modified global entity encoder (GEE) as our core architecture, our approach achieves a 66.8% win rate against HELIOS agent, which employs a state-of-the-art heuristic strategy. In addition, we provided an in-depth analysis of the policy behavior and interpreted the agent's intention using the critic network.

MARLadona -- Towards Cooperative Team Play Using Multi-Agent Reinforcement Learning

TL;DR

The paper addresses the challenge of learning cooperative policies in multi-agent robot soccer by introducing MARLadona, a decentralized MARL framework built on a physics-based Isaac Gym environment. It combines a permutation-aware global entity encoder with end-to-end training and curricula, enabling scalable team play up to 11v11 through self-play and CTDE. The framework achieves a 66.8% win rate against the HELIOS benchmark and demonstrates strong generalization to larger team sizes, accompanied by detailed policy analyses and qualitative behavior insights. This work advances end-to-end MARL in physically grounded multi-agent soccer and provides an open-source environment and architecture to accelerate research and benchmarking.

Abstract

Robot soccer, in its full complexity, poses an unsolved research challenge. Current solutions heavily rely on engineered heuristic strategies, which lack robustness and adaptability. Deep reinforcement learning has gained significant traction in various complex robotics tasks such as locomotion, manipulation, and competitive games (e.g., AlphaZero, OpenAI Five), making it a promising solution to the robot soccer problem. This paper introduces MARLadona. A decentralized multi-agent reinforcement learning (MARL) training pipeline capable of producing agents with sophisticated team play behavior, bridging the shortcomings of heuristic methods. Furthermore, we created an open-source multi-agent soccer environment. Utilizing our MARL framework and a modified global entity encoder (GEE) as our core architecture, our approach achieves a 66.8% win rate against HELIOS agent, which employs a state-of-the-art heuristic strategy. In addition, we provided an in-depth analysis of the policy behavior and interpreted the agent's intention using the critic network.
Paper Structure (17 sections, 2 equations, 7 figures, 3 tables)

This paper contains 17 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An illustration of our marl environment (A) and various top-down views of a 5v5 game (B-D). (B) The trajectory visualizer depicts the general game dynamic. (C) The corresponding default top-down view. (D) The corresponding ball position critic value heat map.
  • Figure 2: An overview of our system. (A) The ego perspective observation from the opponents (red), teammates (blue), and local observation. (B) The soccer environment. (C) Various curricula we adopted during training. (D) The architecture of our policy network. (D1) Encoders with shared weights. (D2) Policy network. (E) The distribution we used for action sampling. (F) The action model of our soccer agent.
  • Figure 3: Initial position curriculum depended on the current policy performance. The ball's initial distribution is adjusted toward the blue side for lower levels to enhance trainees' chances of gaining ball procession. The agent's initial distribution, on the other hand, is kept constant.
  • Figure 4: An overview of our evaluation results conducted for a 3v3 game for three different scenarios (Offensive, Equal, Defensive) against three different adversaries (RL, Bot, HELIOS). The collected average statistics (game outcome (%), team ball ownership (%), the number of successful passes and ball ownership losses, and game duration) are depicted in different rows. Our trainee policy (Blue) achieved clear dominance against all adversaries (besides itself) in all scenarios. The trainees won 66.8% (averaged over all three scenarios) of all games against HELIOS.
  • Figure 5: An illustration of the critic values from a 2v2 (RL vs RL) game as a heat map (Res. 80 $\times$ 80). The plots are acquired by varying the base positions of the trainees (A1, A2) and the ball position (B1, B2) over the whole field while keeping the other observations fixed. The position of the trainees (blue), adversaries (red), and ball (white) are represented by their respective colored dots, and the large black circle indicates which of the trainees the heat map belongs to. Furthermore, C provides the corresponding default top-down view overlaid with motion trajectories to provide additional information about the current game dynamic. The blue areas on these heat maps indicate where the agents want themselves and the ball to be.
  • ...and 2 more figures