Laser Learning Environment: A new environment for coordination-critical multi-agent tasks

Yannick Molinghen; Raphaël Avalos; Mark Van Achter; Ann Nowé; Tom Lenaerts

Laser Learning Environment: A new environment for coordination-critical multi-agent tasks

Yannick Molinghen, Raphaël Avalos, Mark Van Achter, Ann Nowé, Tom Lenaerts

TL;DR

The Laser Learning Environment (LLE) introduces a cooperative multi-agent grid world characterized by perfect coordination, interdependence, and zero-incentive dynamics, creating state-space bottlenecks that impede exploration. The authors benchmark leading CTDE methods (IQL, VDN, QMIX) and demonstrate that, although agents can learn coordinated behaviors, they fail to achieve long-horizon coordination due to bottlenecks. They further analyze learning augmentations—Prioritized Experience Replay, n-step returns, and Random Network Distillation—finding that PER and higher n-step returns hinder exploration under zero-incentive dynamics, while RND offers limited benefits. The work concludes that current value-based MARL methods are ill-suited for LLE and positions LLE as a valuable benchmark for developing new cooperative MARL techniques, with implications for generalization, curriculum learning, and inter-agent communication. The codebase is publicly available, enabling reproducibility and broader experimentation in cooperative MARL research.

Abstract

We introduce the Laser Learning Environment (LLE), a collaborative multi-agent reinforcement learning environment in which coordination is central. In LLE, agents depend on each other to make progress (interdependence), must jointly take specific sequences of actions to succeed (perfect coordination), and accomplishing those joint actions does not yield any intermediate reward (zero-incentive dynamics). The challenge of such problems lies in the difficulty of escaping state space bottlenecks caused by interdependence steps since escaping those bottlenecks is not rewarded. We test multiple state-of-the-art value-based MARL algorithms against LLE and show that they consistently fail at the collaborative task because of their inability to escape state space bottlenecks, even though they successfully achieve perfect coordination. We show that Q-learning extensions such as prioritized experience replay and n-steps return hinder exploration in environments with zero-incentive dynamics, and find that intrinsic curiosity with random network distillation is not sufficient to escape those bottlenecks. We demonstrate the need for novel methods to solve this problem and the relevance of LLE as cooperative MARL benchmark.

Laser Learning Environment: A new environment for coordination-critical multi-agent tasks

TL;DR

Abstract

Paper Structure (43 sections, 8 figures, 3 tables)

This paper contains 43 sections, 8 figures, 3 tables.

Introduction
Contributions
Background
Multi-agent Markov Decision Process
Q-value factorisation
Cooperative multi-agent environments
The StarCraft Multi-Agent Challenge
Overcooked
Hanabi Learning Environment
The Multi-agent Particle Environment
The Laser Learning Environment
Motivations
Perfect coordination
Interdependence
Zero-incentive dynamics
...and 28 more sections

Figures (8)

Figure 1: Level 6 of LLE, which has 4 agents, 3 lasers and 4 gems. Agent red blocks the red laser, making it possible for the other agents to pass to the lower part of the grid world. Additional blocking of the yellow laser is required for them to all pass and reach the exit tiles.
Figure 2: Representation of the state shown in \ref{['fig:lvl6-annotated']}. The layers "Agent 2", "Agent 3", "Laser 2" and "Laser 3" were omitted for the sake of conciseness. Each layer encodes the location of a specific type of object of the grid world (walls, agents' locations, …). White squares represent $0$s, black squares are $1$s and grey squares are $-1$.
Figure 3: Training score and exit rate over time for IQL, VDN and QMIX on level 6 (\ref{['fig:lvl6-annotated']}). The maximal achievable score is 9. Results averaged on 20 different seeds and shown with 95% confidence interval, capped by the minimum and maximum.
Figure 4: Training score and exit rate over training time for VDN, VDN with PER, VDN with RND and VDN with 3-step return on level 6. The maximal score that agents can reach on level 6 of an episode is $9$. Results are averaged on 20 different seeds and shown with 95% confidence intervals
Figure 5: Four consecutive states of an episode. Agent red blocks the laser for agent yellow and waits for the latter to have left the range of the blocked beam.
...and 3 more figures

Laser Learning Environment: A new environment for coordination-critical multi-agent tasks

TL;DR

Abstract

Laser Learning Environment: A new environment for coordination-critical multi-agent tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (8)