Table of Contents
Fetching ...

Towards Better Sample Efficiency in Multi-Agent Reinforcement Learning via Exploration

Amir Baghi, Jens Sjölund, Joakim Bergdahl, Linus Gisslén, Alessandro Sestini

TL;DR

The paper tackles the challenge of poor sample efficiency in multi-agent reinforcement learning by focusing on TiZero in a football setting. It introduces architectural improvements for computational efficiency and two exploration-driven augmentations: a self-supervised intrinsic reward (SSIR) and a random network distillation (RND) bonus, with RND showing an 18.8% gain in sample efficiency under a fixed data budget. Through curriculum self-play experiments and qualitative gameplay evaluations against a heuristic baseline, the study finds that RND promotes a more offensive and confident playstyle, while SSIR can bias strategies toward possession. The findings suggest that well-designed exploration bonuses, particularly RND, can make advanced multi-agent policies more tractable in complex team-based environments and likely generalize beyond football scenarios.

Abstract

Multi-agent reinforcement learning has shown promise in learning cooperative behaviors in team-based environments. However, such methods often demand extensive training time. For instance, the state-of-the-art method TiZero takes 40 days to train high-quality policies for a football environment. In this paper, we hypothesize that better exploration mechanisms can improve the sample efficiency of multi-agent methods. We propose two different approaches for better exploration in TiZero: a self-supervised intrinsic reward and a random network distillation bonus. Additionally, we introduce architectural modifications to the original algorithm to enhance TiZero's computational efficiency. We evaluate the sample efficiency of these approaches through extensive experiments. Our results show that random network distillation improves training sample efficiency by 18.8% compared to the original TiZero. Furthermore, we evaluate the qualitative behavior of the models produced by both variants against a heuristic AI, with the self-supervised reward encouraging possession and random network distillation leading to a more offensive performance. Our results highlights the applicability of our random network distillation variant in practical settings. Lastly, due to the nature of the proposed method, we acknowledge its use beyond football simulation, especially in environments with strong multi-agent and strategic aspects.

Towards Better Sample Efficiency in Multi-Agent Reinforcement Learning via Exploration

TL;DR

The paper tackles the challenge of poor sample efficiency in multi-agent reinforcement learning by focusing on TiZero in a football setting. It introduces architectural improvements for computational efficiency and two exploration-driven augmentations: a self-supervised intrinsic reward (SSIR) and a random network distillation (RND) bonus, with RND showing an 18.8% gain in sample efficiency under a fixed data budget. Through curriculum self-play experiments and qualitative gameplay evaluations against a heuristic baseline, the study finds that RND promotes a more offensive and confident playstyle, while SSIR can bias strategies toward possession. The findings suggest that well-designed exploration bonuses, particularly RND, can make advanced multi-agent policies more tractable in complex team-based environments and likely generalize beyond football scenarios.

Abstract

Multi-agent reinforcement learning has shown promise in learning cooperative behaviors in team-based environments. However, such methods often demand extensive training time. For instance, the state-of-the-art method TiZero takes 40 days to train high-quality policies for a football environment. In this paper, we hypothesize that better exploration mechanisms can improve the sample efficiency of multi-agent methods. We propose two different approaches for better exploration in TiZero: a self-supervised intrinsic reward and a random network distillation bonus. Additionally, we introduce architectural modifications to the original algorithm to enhance TiZero's computational efficiency. We evaluate the sample efficiency of these approaches through extensive experiments. Our results show that random network distillation improves training sample efficiency by 18.8% compared to the original TiZero. Furthermore, we evaluate the qualitative behavior of the models produced by both variants against a heuristic AI, with the self-supervised reward encouraging possession and random network distillation leading to a more offensive performance. Our results highlights the applicability of our random network distillation variant in practical settings. Lastly, due to the nature of the proposed method, we acknowledge its use beyond football simulation, especially in environments with strong multi-agent and strategic aspects.

Paper Structure

This paper contains 17 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: A screenshot of an 11-vs-11 scenario in Google research football environment. We use this environment as a testbed for our training experiments and evaluations.
  • Figure 2: An overview of the architecture of our modified TiZero actor component, with our modifications highlighted in blue. We replace the original in TiZero with a 4-layer and the original player-ID encoder with fixed positional encodings. This network is shared among agents and by passing their player-related information beside the global ones, the network differentiates among individual agents.
  • Figure 3: Comparison of win-rate and reward achieved by the best-performing TiZero-RND (green) and the best-performing standard TiZero (blue) experiments during training. The solid points represent the exponential moving average, and the shaded points are the raw values.