Table of Contents
Fetching ...

AIR: Unifying Individual and Collective Exploration in Cooperative Multi-Agent Reinforcement Learning

Guangchong Zhou, Zeren Zhang, Guoliang Fan

TL;DR

This work tackles the exploration challenge in cooperative value-based MARL under CTDE by introducing AIR, a unified framework that blends individual and collective exploration through an identity classifier and adaptive temperature. AIR leverages trajectory-based diversity via KL divergence and mutual information, and employs an adversarial mechanism to promote exploration of low-probability actions without compromising value estimation. The approach is underpinned by a theoretical connection between classifier accuracy and exploration, and demonstrates strong empirical performance on SMAC and Google Research Football, including ablations that confirm the necessity of integrating both exploration modes and dynamic temperature. Overall, AIR offers a lightweight, effective pathway to scalable exploration in multi-agent coordination tasks with potential for broad applicability in CTDE settings.

Abstract

Exploration in cooperative multi-agent reinforcement learning (MARL) remains challenging for value-based agents due to the absence of an explicit policy. Existing approaches include individual exploration based on uncertainty towards the system and collective exploration through behavioral diversity among agents. However, the introduction of additional structures often leads to reduced training efficiency and infeasible integration of these methods. In this paper, we propose Adaptive exploration via Identity Recognition~(AIR), which consists of two adversarial components: a classifier that recognizes agent identities from their trajectories, and an action selector that adaptively adjusts the mode and degree of exploration. We theoretically prove that AIR can facilitate both individual and collective exploration during training, and experiments also demonstrate the efficiency and effectiveness of AIR across various tasks.

AIR: Unifying Individual and Collective Exploration in Cooperative Multi-Agent Reinforcement Learning

TL;DR

This work tackles the exploration challenge in cooperative value-based MARL under CTDE by introducing AIR, a unified framework that blends individual and collective exploration through an identity classifier and adaptive temperature. AIR leverages trajectory-based diversity via KL divergence and mutual information, and employs an adversarial mechanism to promote exploration of low-probability actions without compromising value estimation. The approach is underpinned by a theoretical connection between classifier accuracy and exploration, and demonstrates strong empirical performance on SMAC and Google Research Football, including ablations that confirm the necessity of integrating both exploration modes and dynamic temperature. Overall, AIR offers a lightweight, effective pathway to scalable exploration in multi-agent coordination tasks with potential for broad applicability in CTDE settings.

Abstract

Exploration in cooperative multi-agent reinforcement learning (MARL) remains challenging for value-based agents due to the absence of an explicit policy. Existing approaches include individual exploration based on uncertainty towards the system and collective exploration through behavioral diversity among agents. However, the introduction of additional structures often leads to reduced training efficiency and infeasible integration of these methods. In this paper, we propose Adaptive exploration via Identity Recognition~(AIR), which consists of two adversarial components: a classifier that recognizes agent identities from their trajectories, and an action selector that adaptively adjusts the mode and degree of exploration. We theoretically prove that AIR can facilitate both individual and collective exploration during training, and experiments also demonstrate the efficiency and effectiveness of AIR across various tasks.

Paper Structure

This paper contains 29 sections, 3 theorems, 30 equations, 8 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

Given the system trajectory visit distribution $\rho$, of which the entropy $\mathcal{H}(\rho)$ can be decomposed as below:

Figures (8)

  • Figure 1: The importance of behavioral diversity in Google Research Football. (a) Agents all compete for the ball, exhibiting homogeneous behaviors and poor coordination. (b) Agents behave differently to achieve coordination.
  • Figure 2: Experiment results of AIR and baselines on SMAC.
  • Figure 3: Experiment results of AIR and baselines on GRF.
  • Figure 4: The training session on SMAC corridor. Up: The curves that depict the changes of win rate and temperature value during training. Down: The 2D t-SNE visualizations of the agents' trajectories at different training steps.
  • Figure 5: The relative model sizes of algorithms.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Definition 1: Trajectory visit distribution
  • Definition 2: Difference between individual policies
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof