Adaptive Episode Length Adjustment for Multi-agent Reinforcement Learning
Byunghyun Yoo, Younghwan Shin, Hyunwoo Kim, Euisok Chung, Jeongmin Yang
TL;DR
This work addresses how fixed episode lengths hinder learning in multi-agent reinforcement learning by proposing Adaptive Episode Length Adjustment (AELA), which begins with short episodes and incrementally increases the horizon based on learning progress. The approach is grounded in a theoretical link between shorter horizons, secure exploration, and reduced dead-end states, using entropy of value estimates as a progression signal. Empirically, AELA enhances convergence speed and final performance on SMAC and Modified Predator-Prey, improving both VDN- and QMIX-based MARL methods. The findings offer a practical, model-agnostic strategy to boost MARL performance in complex, multi-agent environments where dead-ends and long-horizon planning pose challenges.
Abstract
In standard reinforcement learning, an episode is defined as a sequence of interactions between agents and the environment, which terminates upon reaching a terminal state or a pre-defined episode length. Setting a shorter episode length enables the generation of multiple episodes with the same number of data samples, thereby facilitating an exploration of diverse states. While shorter episodes may limit the collection of long-term interactions, they may offer significant advantages when properly managed. For example, trajectory truncation in single-agent reinforcement learning has shown how the benefits of shorter episodes can be leveraged despite the trade-off of reduced long-term interaction experiences. However, this approach remains underexplored in MARL. This paper proposes a novel MARL approach, Adaptive Episode Length Adjustment (AELA), where the episode length is initially limited and gradually increased based on an entropy-based assessment of learning progress. By starting with shorter episodes, agents can focus on learning effective strategies for initial states and minimize time spent in dead-end states. The use of entropy as an assessment metric prevents premature convergence to suboptimal policies and ensures balanced training over varying episode lengths. We validate our approach using the StarCraft Multi-agent Challenge (SMAC) and a modified predator-prey environment, demonstrating significant improvements in both convergence speed and overall performance compared to existing methods. To the best of our knowledge, this is the first study to adaptively adjust episode length in MARL based on learning progress.
