Table of Contents
Fetching ...

Learning Progress Driven Multi-Agent Curriculum

Wenshuai Zhao, Zhiyuan Li, Joni Pajarinen

TL;DR

This paper tackles the challenge of curriculum design in multi-agent reinforcement learning (MARL) with sparse rewards, focusing on how the number of agents influences exploration and credit assignment. It moves beyond reward-based curricula by introducing a learning-progress driven approach that uses TD-error based learning progress to guide context distributions, transitioning from easier to target tasks. The authors present SPRLM as an MARL extension of self-paced curriculum learning and then introduce Self-Paced MARL (SPMARL), which estimates learning progress from the critic and reduces variance in curriculum estimation. Through experiments on MPE Simple-Spread, XOR, and SMAC-v2 Protoss tasks, SPMARL consistently outperforms baselines, demonstrating faster learning and more stable curricula with practical impact for scalable MARL in sparse-reward settings.

Abstract

The number of agents can be an effective curriculum variable for controlling the difficulty of multi-agent reinforcement learning (MARL) tasks. Existing work typically uses manually defined curricula such as linear schemes. We identify two potential flaws while applying existing reward-based automatic curriculum learning methods in MARL: (1) The expected episode return used to measure task difficulty has high variance; (2) Credit assignment difficulty can be exacerbated in tasks where increasing the number of agents yields higher returns which is common in many MARL tasks. To address these issues, we propose to control the curriculum by using a TD-error based *learning progress* measure and by letting the curriculum proceed from an initial context distribution to the final task specific one. Since our approach maintains a distribution over the number of agents and measures learning progress rather than absolute performance, which often increases with the number of agents, we alleviate problem (2). Moreover, the learning progress measure naturally alleviates problem (1) by aggregating returns. In three challenging sparse-reward MARL benchmarks, our approach outperforms state-of-the-art baselines.

Learning Progress Driven Multi-Agent Curriculum

TL;DR

This paper tackles the challenge of curriculum design in multi-agent reinforcement learning (MARL) with sparse rewards, focusing on how the number of agents influences exploration and credit assignment. It moves beyond reward-based curricula by introducing a learning-progress driven approach that uses TD-error based learning progress to guide context distributions, transitioning from easier to target tasks. The authors present SPRLM as an MARL extension of self-paced curriculum learning and then introduce Self-Paced MARL (SPMARL), which estimates learning progress from the critic and reduces variance in curriculum estimation. Through experiments on MPE Simple-Spread, XOR, and SMAC-v2 Protoss tasks, SPMARL consistently outperforms baselines, demonstrating faster learning and more stable curricula with practical impact for scalable MARL in sparse-reward settings.

Abstract

The number of agents can be an effective curriculum variable for controlling the difficulty of multi-agent reinforcement learning (MARL) tasks. Existing work typically uses manually defined curricula such as linear schemes. We identify two potential flaws while applying existing reward-based automatic curriculum learning methods in MARL: (1) The expected episode return used to measure task difficulty has high variance; (2) Credit assignment difficulty can be exacerbated in tasks where increasing the number of agents yields higher returns which is common in many MARL tasks. To address these issues, we propose to control the curriculum by using a TD-error based *learning progress* measure and by letting the curriculum proceed from an initial context distribution to the final task specific one. Since our approach maintains a distribution over the number of agents and measures learning progress rather than absolute performance, which often increases with the number of agents, we alleviate problem (2). Moreover, the learning progress measure naturally alleviates problem (1) by aggregating returns. In three challenging sparse-reward MARL benchmarks, our approach outperforms state-of-the-art baselines.
Paper Structure (25 sections, 9 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 9 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: In Simple-Spread task, agents (blue circles) need to cover as many landmarks (red circles) as possible. With the number of landmarks fixed, $20$ agents shown on the right can easily complete the task and achieve higher returns compared to $8$ agents on the left. However, a higher number of agents exacerbates the credit assignment problem in policy learning.
  • Figure 2: SPRL involves a two-stage optimization.
  • Figure 3: (a): The payoff matrix of 2-player XOR game. (b): Scenario from Protoss 5 vs. 5 in SMACv2 showing agents battling the built-in AI.
  • Figure 4: Comparison on the Simple-Spread task, where the target is set with $8$ agents and $8$ landmarks. The plots are averaged over $5$ random seeds and the shadow area denotes the $95\%$ confidence intervals. The left figure shows the evaluation returns on the target task with $8$ agents. Note that the x-axis represents the samples collected from the environment, which is proportional to the number of agents. The middle figure presents the generated curriculum from different methods, where SPMARL and SPRLM first generate more agents and then converge to the target $8$ agents while ALPGMM and VACL always generates more agents. The right figure shows the episode returns on the training tasks. The ALPGMM algorithm achieves the highest because it samples tasks with more than $14$ agents.
  • Figure 5: Comparison on the 20-player XOR game where each agent needs to output different actions to succeed. While the linear curriculum from few to more (linear) and alpgmm successfully achieve optima eventually, SPRLM and SPMARL demonstrate a faster convergence.
  • ...and 6 more figures