Table of Contents
Fetching ...

ARAC: Adaptive Regularized Multi-Agent Soft Actor-Critic in Graph-Structured Adversarial Games

Ruochuan Shi, Runyu Lu, Yuanheng Zhu, Dongbin Zhao

TL;DR

The paper addresses the challenge of sparse rewards in graph-structured multi-agent adversarial tasks by proposing ARAC, which combines an attention-based graph neural encoder–decoder with adaptive divergence regularization within a Soft Actor-Critic framework. It introduces a reference-policy-guided KL term that is adaptively scheduled to exploit guidance early in training while avoiding premature convergence to suboptimal behaviors. Theoretical results guarantee convergence properties, and comprehensive experiments in pursuit and confrontation settings show faster convergence, higher final success rates, and robust scalability, including cross-map generalization and self-play benefits. The work demonstrates the value of integrating graph-aware representations with adaptive regularization to enhance coordination and decision-making in complex multi-agent environments.

Abstract

In graph-structured multi-agent reinforcement learning (MARL) adversarial tasks such as pursuit and confrontation, agents must coordinate under highly dynamic interactions, where sparse rewards hinder efficient policy learning. We propose Adaptive Regularized Multi-Agent Soft Actor-Critic (ARAC), which integrates an attention-based graph neural network (GNN) for modeling agent dependencies with an adaptive divergence regularization mechanism. The GNN enables expressive representation of spatial relations and state features in graph environments. Divergence regularization can serve as policy guidance to alleviate the sparse reward problem, but it may lead to suboptimal convergence when the reference policy itself is imperfect. The adaptive divergence regularization mechanism enables the framework to exploit reference policies for efficient exploration in the early stages, while gradually reducing reliance on them as training progresses to avoid inheriting their limitations. Experiments in pursuit and confrontation scenarios demonstrate that ARAC achieves faster convergence, higher final success rates, and stronger scalability across varying numbers of agents compared with MARL baselines, highlighting its effectiveness in complex graph-structured environments.

ARAC: Adaptive Regularized Multi-Agent Soft Actor-Critic in Graph-Structured Adversarial Games

TL;DR

The paper addresses the challenge of sparse rewards in graph-structured multi-agent adversarial tasks by proposing ARAC, which combines an attention-based graph neural encoder–decoder with adaptive divergence regularization within a Soft Actor-Critic framework. It introduces a reference-policy-guided KL term that is adaptively scheduled to exploit guidance early in training while avoiding premature convergence to suboptimal behaviors. Theoretical results guarantee convergence properties, and comprehensive experiments in pursuit and confrontation settings show faster convergence, higher final success rates, and robust scalability, including cross-map generalization and self-play benefits. The work demonstrates the value of integrating graph-aware representations with adaptive regularization to enhance coordination and decision-making in complex multi-agent environments.

Abstract

In graph-structured multi-agent reinforcement learning (MARL) adversarial tasks such as pursuit and confrontation, agents must coordinate under highly dynamic interactions, where sparse rewards hinder efficient policy learning. We propose Adaptive Regularized Multi-Agent Soft Actor-Critic (ARAC), which integrates an attention-based graph neural network (GNN) for modeling agent dependencies with an adaptive divergence regularization mechanism. The GNN enables expressive representation of spatial relations and state features in graph environments. Divergence regularization can serve as policy guidance to alleviate the sparse reward problem, but it may lead to suboptimal convergence when the reference policy itself is imperfect. The adaptive divergence regularization mechanism enables the framework to exploit reference policies for efficient exploration in the early stages, while gradually reducing reliance on them as training progresses to avoid inheriting their limitations. Experiments in pursuit and confrontation scenarios demonstrate that ARAC achieves faster convergence, higher final success rates, and stronger scalability across varying numbers of agents compared with MARL baselines, highlighting its effectiveness in complex graph-structured environments.

Paper Structure

This paper contains 34 sections, 2 theorems, 21 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

theorem 1

Assume the reward $r(s,\mathbf{a})$ is bounded and the regularization term $\Omega_s(\pi)$ for a fixed policy $\pi$ is bounded. Then the operator $\mathcal{T}^\pi$ is a $\gamma$-contraction in the $\|\cdot\|_\infty$ norm. The iteration $Q \leftarrow \mathcal{T}^\pi Q$ converges uniformly to the uniq

Figures (10)

  • Figure 1: Illustration of pursuit and confrontation scenario.
  • Figure 2: Structure of the feature representation graph encoding method.
  • Figure 3: Overview of the ARAC training framework.
  • Figure 4: Success rate curves of our method and baseline algorithms in pursuit and confrontation scenario. Shaded regions denote the standard deviation over 3 runs.
  • Figure 5: Success rate curves of our method and other feature representation approaches in pursuit and confrontation scenario. Shaded regions denote the standard deviation over 3 runs.
  • ...and 5 more figures

Theorems & Definitions (2)

  • theorem 1: Policy Evaluation
  • theorem 2: Policy Improvement