Table of Contents
Fetching ...

CURO: Curriculum Learning for Relative Overgeneralization

Lin Shi, Qiyuan Liu, Bei Peng

TL;DR

CURO, a novel approach called curriculum learning for relative overgeneralization (CURO), can successfully overcome severe RO, achieve improved performance, and outperform baseline methods in a variety of challenging cooperative multi-agent tasks.

Abstract

Relative overgeneralization (RO) is a pathology that can arise in cooperative multi-agent tasks when the optimal joint action's utility falls below that of a sub-optimal joint action. RO can cause the agents to get stuck into local optima or fail to solve cooperative tasks requiring significant coordination between agents within a given timestep. In this work, we empirically find that, in multi-agent reinforcement learning (MARL), both value-based and policy gradient MARL algorithms can suffer from RO and fail to learn effective coordination policies. To better overcome RO, we propose a novel approach called curriculum learning for relative overgeneralization (CURO). To solve a target task that exhibits strong RO, in CURO, we first fine-tune the reward function of the target task to generate source tasks to train the agent. Then, to effectively transfer the knowledge acquired in one task to the next, we use a transfer learning method that combines value function transfer with buffer transfer, which enables more efficient exploration in the target task. CURO is general and can be applied to both value-based and policy gradient MARL methods. We demonstrate that, when applied to QMIX, HAPPO, and HATRPO, CURO can successfully overcome severe RO, achieve improved performance, and outperform baseline methods in a variety of challenging cooperative multi-agent tasks.

CURO: Curriculum Learning for Relative Overgeneralization

TL;DR

CURO, a novel approach called curriculum learning for relative overgeneralization (CURO), can successfully overcome severe RO, achieve improved performance, and outperform baseline methods in a variety of challenging cooperative multi-agent tasks.

Abstract

Relative overgeneralization (RO) is a pathology that can arise in cooperative multi-agent tasks when the optimal joint action's utility falls below that of a sub-optimal joint action. RO can cause the agents to get stuck into local optima or fail to solve cooperative tasks requiring significant coordination between agents within a given timestep. In this work, we empirically find that, in multi-agent reinforcement learning (MARL), both value-based and policy gradient MARL algorithms can suffer from RO and fail to learn effective coordination policies. To better overcome RO, we propose a novel approach called curriculum learning for relative overgeneralization (CURO). To solve a target task that exhibits strong RO, in CURO, we first fine-tune the reward function of the target task to generate source tasks to train the agent. Then, to effectively transfer the knowledge acquired in one task to the next, we use a transfer learning method that combines value function transfer with buffer transfer, which enables more efficient exploration in the target task. CURO is general and can be applied to both value-based and policy gradient MARL methods. We demonstrate that, when applied to QMIX, HAPPO, and HATRPO, CURO can successfully overcome severe RO, achieve improved performance, and outperform baseline methods in a variety of challenging cooperative multi-agent tasks.
Paper Structure (17 sections, 7 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: The partially-observable predator-prey task, where two agents are rewarded when they execute the catch action simultaneously and punished when one attempt it alone son2019qtran.
  • Figure 2: Mean test return for QMIX, QPLEX, WQMIX, and CURO-QMIX on predator-prey with different levels of difficulty. The 95% confidence interval is shown shaded. For CURO-QMIX, the learning curve is offset to reflect timesteps in source tasks.
  • Figure 3: Median test win rates for QMIX, QPLEX, WQMIX and CURO-QMIX on four different SMAC maps with negative reward scaling $p=1.0$. The $0-100\%$ percentiles are shown shaded. For CURO-QMIX, the learning curve is offset to reflect timesteps in source tasks.
  • Figure 4: Mean episode reward for HAPPO, HATRPO, CURO-HAPPO and CURO-HATRPO on four MaMuJoCo environments with negative reward scaling $p$. The 95% confidence interval is shown shaded. For CURO-HAPPO and CURO-HATRPO, the learning curves are offset to reflect timesteps in source tasks.
  • Figure 5: Mean test return for CURO-QMIX with and without buffer transfer on two predator-prey tasks with different levels of difficulty. The 95% confidence interval is shown shaded. We show only the timesteps in the target task.
  • ...and 2 more figures