Table of Contents
Fetching ...

Towards Principled Unsupervised Multi-Agent Reinforcement Learning

Riccardo Zamboni, Mirco Mutti, Marcello Restelli

TL;DR

This work tackles unsupervised pre-training for multi-agent reinforcement learning by casting state-entropy maximization within convex Markov Games. It delineates three objective families—joint, disjoint, and mixture—and derives entropy-mismatch bounds that reveal mixture entropy as a tractable and effective surrogate, especially in finite-trial regimes. The authors introduce TRPE, a decentralized trust-region algorithm, and demonstrate empirically that mixture-based pre-training yields faster learning and strong zero-shot transfer in sparse-reward downstream tasks, while joint/disjoint objectives can under- or over-optimize. The study provides practical guidance for coordinating exploration in multi-agent settings and points to promising directions for scalable, principled unsupervised MA-RL in more complex domains.

Abstract

In reinforcement learning, we typically refer to unsupervised pre-training when we aim to pre-train a policy without a priori access to the task specification, i.e. rewards, to be later employed for efficient learning of downstream tasks. In single-agent settings, the problem has been extensively studied and mostly understood. A popular approach, called task-agnostic exploration, casts the unsupervised objective as maximizing the entropy of the state distribution induced by the agent's policy, from which principles and methods follow. In contrast, little is known about it in multi-agent settings, which are ubiquitous in the real world. What are the pros and cons of alternative problem formulations in this setting? How hard is the problem in theory, how can we solve it in practice? In this paper, we address these questions by first characterizing those alternative formulations and highlighting how the problem, even when tractable in theory, is non-trivial in practice. Then, we present a scalable, decentralized, trust-region policy search algorithm to address the problem in practical settings. Finally, we provide numerical validations to both corroborate the theoretical findings and pave the way for unsupervised multi-agent reinforcement learning via task-agnostic exploration in challenging domains, showing that optimizing for a specific objective, namely mixture entropy, provides an excellent trade-off between tractability and performances.

Towards Principled Unsupervised Multi-Agent Reinforcement Learning

TL;DR

This work tackles unsupervised pre-training for multi-agent reinforcement learning by casting state-entropy maximization within convex Markov Games. It delineates three objective families—joint, disjoint, and mixture—and derives entropy-mismatch bounds that reveal mixture entropy as a tractable and effective surrogate, especially in finite-trial regimes. The authors introduce TRPE, a decentralized trust-region algorithm, and demonstrate empirically that mixture-based pre-training yields faster learning and strong zero-shot transfer in sparse-reward downstream tasks, while joint/disjoint objectives can under- or over-optimize. The study provides practical guidance for coordinating exploration in multi-agent settings and points to promising directions for scalable, principled unsupervised MA-RL in more complex domains.

Abstract

In reinforcement learning, we typically refer to unsupervised pre-training when we aim to pre-train a policy without a priori access to the task specification, i.e. rewards, to be later employed for efficient learning of downstream tasks. In single-agent settings, the problem has been extensively studied and mostly understood. A popular approach, called task-agnostic exploration, casts the unsupervised objective as maximizing the entropy of the state distribution induced by the agent's policy, from which principles and methods follow. In contrast, little is known about it in multi-agent settings, which are ubiquitous in the real world. What are the pros and cons of alternative problem formulations in this setting? How hard is the problem in theory, how can we solve it in practice? In this paper, we address these questions by first characterizing those alternative formulations and highlighting how the problem, even when tractable in theory, is non-trivial in practice. Then, we present a scalable, decentralized, trust-region policy search algorithm to address the problem in practical settings. Finally, we provide numerical validations to both corroborate the theoretical findings and pave the way for unsupervised multi-agent reinforcement learning via task-agnostic exploration in challenging domains, showing that optimizing for a specific objective, namely mixture entropy, provides an excellent trade-off between tractability and performances.

Paper Structure

This paper contains 13 sections, 9 theorems, 37 equations, 5 figures.

Key Result

Lemma 1

For every cMG $\mathcal{M}_{H}$, for a fixed (joint) policy $\pi = (\pi^{i})_{i \in \mathcal{N}}$ the infinite-trials objectives are ordered according to:

Figures (5)

  • Figure 1: The interaction on the left induces different (empirical) distributions: Marginal distributions for agent 1 and agent 2 over their respective states; a joint distribution over the product space; a mixture distribution over a common space, defined as the average. The mixture distribution is usually less sparse.
  • Figure 2: Single-trial Joint and Mixture Entropy induced by different objective optimization along a $T = 50$ horizon. (Right) State Distributions of two agents induced by different learned policies. We report the average and 95% confidence interval over 4 runs.
  • Figure 3: Effect of pre-training in sparse-reward settings. (Left) Policies initialized with either Uniform or TRPE pre-trained policies. (Right) Policies initialized with either Zero-Mean or TRPE pre-trained policies. We report the average and 95% c.i. over 4 runs over worst-case goals.
  • Figure 4: Full Visualization of Reported Experiments. Experiments with longer horizons highlight how the easier the task, the less crucial the distinction between the objectives is.
  • Figure 5: Policiy Entropy Insights for TRPO Pretraining in Env (i) and Env (ii). Lower Entropic Policies with Disjoint Objectives might justify the difference in pre-training performance even if the performances in training are similar.

Theorems & Definitions (15)

  • Lemma 1: Entropy Mismatch
  • Theorem 4.1: Finite-Trials Mismatch in cMGs
  • Definition 5.1: Surrogate Function over a Single Trial
  • Lemma 1: Entropy Mismatch
  • proof
  • Theorem B.1: Finite-Trials Mismatch in cMGs
  • proof
  • Lemma B.4: (i) Global optimality of stationary policies zhang2020variationalpolicygradientmethod
  • Lemma B.5: (ii) Projection Operator leonardos2021globalconvergencemultiagentpolicy
  • Theorem B.6: (iii) Convergence rate of independent PGA to stationary points (Formal Fact \ref{['fact:sufficiencypga']})
  • ...and 5 more