Table of Contents
Fetching ...

MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure

Zhicheng Zhang, Yancheng Liang, Yi Wu, Fei Fang

TL;DR

MESA tackles the exploration bottleneck in cooperative multi-agent reinforcement learning, where sparse rewards and large joint action spaces hinder convergence to Pareto-optimal equilibria. It introduces a meta-exploration framework that first identifies a high-rewarding joint state-action subspace from a batch of training tasks and then learns a diverse set of exploration policies to cover this subspace, which can be plugged into any off-policy MARL algorithm at test time. The method combines subspace discovery with iterative policy coverage and uses pseudo-count-based rewards to promote broad, non-redundant exploration. Empirical results across the Climb Game, multi-agent MPE, and multi-agent MuJoCo demonstrate that MESA surpasses competitive baselines and generalizes to harder, unseen tasks, indicating strong practical potential for scalable cooperative MARL.

Abstract

Multi-agent reinforcement learning (MARL) algorithms often struggle to find strategies close to Pareto optimal Nash Equilibrium, owing largely to the lack of efficient exploration. The problem is exacerbated in sparse-reward settings, caused by the larger variance exhibited in policy learning. This paper introduces MESA, a novel meta-exploration method for cooperative multi-agent learning. It learns to explore by first identifying the agents' high-rewarding joint state-action subspace from training tasks and then learning a set of diverse exploration policies to "cover" the subspace. These trained exploration policies can be integrated with any off-policy MARL algorithm for test-time tasks. We first showcase MESA's advantage in a multi-step matrix game. Furthermore, experiments show that with learned exploration policies, MESA achieves significantly better performance in sparse-reward tasks in several multi-agent particle environments and multi-agent MuJoCo environments, and exhibits the ability to generalize to more challenging tasks at test time.

MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure

TL;DR

MESA tackles the exploration bottleneck in cooperative multi-agent reinforcement learning, where sparse rewards and large joint action spaces hinder convergence to Pareto-optimal equilibria. It introduces a meta-exploration framework that first identifies a high-rewarding joint state-action subspace from a batch of training tasks and then learns a diverse set of exploration policies to cover this subspace, which can be plugged into any off-policy MARL algorithm at test time. The method combines subspace discovery with iterative policy coverage and uses pseudo-count-based rewards to promote broad, non-redundant exploration. Empirical results across the Climb Game, multi-agent MPE, and multi-agent MuJoCo demonstrate that MESA surpasses competitive baselines and generalizes to harder, unseen tasks, indicating strong practical potential for scalable cooperative MARL.

Abstract

Multi-agent reinforcement learning (MARL) algorithms often struggle to find strategies close to Pareto optimal Nash Equilibrium, owing largely to the lack of efficient exploration. The problem is exacerbated in sparse-reward settings, caused by the larger variance exhibited in policy learning. This paper introduces MESA, a novel meta-exploration method for cooperative multi-agent learning. It learns to explore by first identifying the agents' high-rewarding joint state-action subspace from training tasks and then learning a set of diverse exploration policies to "cover" the subspace. These trained exploration policies can be integrated with any off-policy MARL algorithm for test-time tasks. We first showcase MESA's advantage in a multi-step matrix game. Furthermore, experiments show that with learned exploration policies, MESA achieves significantly better performance in sparse-reward tasks in several multi-agent particle environments and multi-agent MuJoCo environments, and exhibits the ability to generalize to more challenging tasks at test time.
Paper Structure (35 sections, 6 theorems, 16 equations, 11 figures, 2 tables, 2 algorithms)

This paper contains 35 sections, 6 theorems, 16 equations, 11 figures, 2 tables, 2 algorithms.

Key Result

Theorem 4.2

Assume $\delta\le \frac{1}{6}, U \ge 3$. Using a uniform exploration policy in the climb game $G_f(2,0,U)$, it can be proved that $q_{\mathcal{J}^{(T)}}(\mathbf{W}, \mathbf{b}, \mathbf{c}, d)$ will become equivalently optimal only after $T=\Omega(|\mathcal{A}|\delta^{-1})$ steps. When $\delta=1$, $T

Figures (11)

  • Figure 1: Illustration of structured exploration and unstructured exploration behavior in the $2$-player climb game. The rows and columns indicate the players' action space. While unstructured exploration aims to visit novel states, structured exploration exploits structures in the joint state-action space, helping agents coordinatedly and more efficiently explore the potential high-reward subspace.
  • Figure 2: MESA's meta-learning framework. In the meta-training stage, MESA learns exploration policies to cover the high-rewarding subspace. In the meta-testing stage, MESA uses the learned exploration policies to assist the learning in an unseen task. Each color corresponds to a different task, and the colored points represent the high-rewarding joint state-action pairs collected in that task.
  • Figure 3: Learning curve of the two climb game variants w.r.t number of environment steps. The return is averaged over timesteps for the multi-stage games. The dotted lines indicate the suboptimal return of $0.5$ (purple) and the optimal return $1$ (blue) for each agent.
  • Figure 4: Learning curves of MESA and the compared baselines w.r.t the number of environment interactions during the meta-testing stage in the MPE domain and the multi-agent MuJoCo environment Swimmer. The two dotted lines indicate the ideal optimal (purple) and sub-optimal (blue) return summed over timesteps. A return above the blue line would typically indicate that the agents are able to learn the optimal strategy.
  • Figure 5: Visualizations of a $2$-player $3$-landmark MPE climb game.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Definition 4.1
  • Theorem 4.2: uniform exploration
  • Theorem 4.3: $\epsilon$-greedy exploration
  • Theorem 4.4: structured exploration
  • Theorem 5.1: Exploration during Meta-Testing
  • Lemma A.1
  • Definition A.2: $\epsilon$ Generalization
  • Lemma A.3