Laser Learning Environment: A new environment for coordination-critical multi-agent tasks
Yannick Molinghen, Raphaël Avalos, Mark Van Achter, Ann Nowé, Tom Lenaerts
TL;DR
The Laser Learning Environment (LLE) introduces a cooperative multi-agent grid world characterized by perfect coordination, interdependence, and zero-incentive dynamics, creating state-space bottlenecks that impede exploration. The authors benchmark leading CTDE methods (IQL, VDN, QMIX) and demonstrate that, although agents can learn coordinated behaviors, they fail to achieve long-horizon coordination due to bottlenecks. They further analyze learning augmentations—Prioritized Experience Replay, n-step returns, and Random Network Distillation—finding that PER and higher n-step returns hinder exploration under zero-incentive dynamics, while RND offers limited benefits. The work concludes that current value-based MARL methods are ill-suited for LLE and positions LLE as a valuable benchmark for developing new cooperative MARL techniques, with implications for generalization, curriculum learning, and inter-agent communication. The codebase is publicly available, enabling reproducibility and broader experimentation in cooperative MARL research.
Abstract
We introduce the Laser Learning Environment (LLE), a collaborative multi-agent reinforcement learning environment in which coordination is central. In LLE, agents depend on each other to make progress (interdependence), must jointly take specific sequences of actions to succeed (perfect coordination), and accomplishing those joint actions does not yield any intermediate reward (zero-incentive dynamics). The challenge of such problems lies in the difficulty of escaping state space bottlenecks caused by interdependence steps since escaping those bottlenecks is not rewarded. We test multiple state-of-the-art value-based MARL algorithms against LLE and show that they consistently fail at the collaborative task because of their inability to escape state space bottlenecks, even though they successfully achieve perfect coordination. We show that Q-learning extensions such as prioritized experience replay and n-steps return hinder exploration in environments with zero-incentive dynamics, and find that intrinsic curiosity with random network distillation is not sufficient to escape those bottlenecks. We demonstrate the need for novel methods to solve this problem and the relevance of LLE as cooperative MARL benchmark.
