MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure

Zhicheng Zhang; Yancheng Liang; Yi Wu; Fei Fang

MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure

Zhicheng Zhang, Yancheng Liang, Yi Wu, Fei Fang

TL;DR

MESA tackles the exploration bottleneck in cooperative multi-agent reinforcement learning, where sparse rewards and large joint action spaces hinder convergence to Pareto-optimal equilibria. It introduces a meta-exploration framework that first identifies a high-rewarding joint state-action subspace from a batch of training tasks and then learns a diverse set of exploration policies to cover this subspace, which can be plugged into any off-policy MARL algorithm at test time. The method combines subspace discovery with iterative policy coverage and uses pseudo-count-based rewards to promote broad, non-redundant exploration. Empirical results across the Climb Game, multi-agent MPE, and multi-agent MuJoCo demonstrate that MESA surpasses competitive baselines and generalizes to harder, unseen tasks, indicating strong practical potential for scalable cooperative MARL.

Abstract

Multi-agent reinforcement learning (MARL) algorithms often struggle to find strategies close to Pareto optimal Nash Equilibrium, owing largely to the lack of efficient exploration. The problem is exacerbated in sparse-reward settings, caused by the larger variance exhibited in policy learning. This paper introduces MESA, a novel meta-exploration method for cooperative multi-agent learning. It learns to explore by first identifying the agents' high-rewarding joint state-action subspace from training tasks and then learning a set of diverse exploration policies to "cover" the subspace. These trained exploration policies can be integrated with any off-policy MARL algorithm for test-time tasks. We first showcase MESA's advantage in a multi-step matrix game. Furthermore, experiments show that with learned exploration policies, MESA achieves significantly better performance in sparse-reward tasks in several multi-agent particle environments and multi-agent MuJoCo environments, and exhibits the ability to generalize to more challenging tasks at test time.

MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure

TL;DR

Abstract

Paper Structure (35 sections, 6 theorems, 16 equations, 11 figures, 2 tables, 2 algorithms)

This paper contains 35 sections, 6 theorems, 16 equations, 11 figures, 2 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
A Motivating Example: Climb Game
Exploration Challenge
Structured Exploration
Method
Meta-Training
Identifying High-Rewarding Joint State-Action Subspace
Learning Exploration Policies
Meta-Testing
Implementation Detail of MESA
Experiments
Evaluation Setup
Climb Game Variants
...and 20 more sections

Key Result

Theorem 4.2

Assume $\delta\le \frac{1}{6}, U \ge 3$. Using a uniform exploration policy in the climb game $G_f(2,0,U)$, it can be proved that $q_{\mathcal{J}^{(T)}}(\mathbf{W}, \mathbf{b}, \mathbf{c}, d)$ will become equivalently optimal only after $T=\Omega(|\mathcal{A}|\delta^{-1})$ steps. When $\delta=1$, $T

Figures (11)

Figure 1: Illustration of structured exploration and unstructured exploration behavior in the $2$-player climb game. The rows and columns indicate the players' action space. While unstructured exploration aims to visit novel states, structured exploration exploits structures in the joint state-action space, helping agents coordinatedly and more efficiently explore the potential high-reward subspace.
Figure 2: MESA's meta-learning framework. In the meta-training stage, MESA learns exploration policies to cover the high-rewarding subspace. In the meta-testing stage, MESA uses the learned exploration policies to assist the learning in an unseen task. Each color corresponds to a different task, and the colored points represent the high-rewarding joint state-action pairs collected in that task.
Figure 3: Learning curve of the two climb game variants w.r.t number of environment steps. The return is averaged over timesteps for the multi-stage games. The dotted lines indicate the suboptimal return of $0.5$ (purple) and the optimal return $1$ (blue) for each agent.
Figure 4: Learning curves of MESA and the compared baselines w.r.t the number of environment interactions during the meta-testing stage in the MPE domain and the multi-agent MuJoCo environment Swimmer. The two dotted lines indicate the ideal optimal (purple) and sub-optimal (blue) return summed over timesteps. A return above the blue line would typically indicate that the agents are able to learn the optimal strategy.
Figure 5: Visualizations of a $2$-player $3$-landmark MPE climb game.
...and 6 more figures

Theorems & Definitions (8)

Definition 4.1
Theorem 4.2: uniform exploration
Theorem 4.3: $\epsilon$-greedy exploration
Theorem 4.4: structured exploration
Theorem 5.1: Exploration during Meta-Testing
Lemma A.1
Definition A.2: $\epsilon$ Generalization
Lemma A.3

MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure

TL;DR

Abstract

MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (8)