Table of Contents
Fetching ...

Generalized Per-Agent Advantage Estimation for Multi-Agent Policy Optimization

Seongmin Kim, Giseung Park, Woojun Kim, Jiwon Jeon, Seungyeol Han, Youngchul Sung

TL;DR

A novel framework for multi-agent reinforcement learning that enhances sample efficiency and coordination through accurate per-agent advantage estimation through Generalized Per-Agent Advantage Estimator, which employs a per-agent value iteration operator to compute precise per-agent advantages.

Abstract

In this paper, we propose a novel framework for multi-agent reinforcement learning that enhances sample efficiency and coordination through accurate per-agent advantage estimation. The core of our approach is Generalized Per-Agent Advantage Estimator (GPAE), which employs a per-agent value iteration operator to compute precise per-agent advantages. This operator enables stable off-policy learning by indirectly estimating values via action probabilities, eliminating the need for direct Q-function estimation. To further refine estimation, we introduce a double-truncated importance sampling ratio scheme. This scheme improves credit assignment for off-policy trajectories by balancing sensitivity to the agent's own policy changes with robustness to non-stationarity from other agents. Experiments on benchmarks demonstrate that our approach outperforms existing approaches, excelling in coordination and sample efficiency for complex scenarios.

Generalized Per-Agent Advantage Estimation for Multi-Agent Policy Optimization

TL;DR

A novel framework for multi-agent reinforcement learning that enhances sample efficiency and coordination through accurate per-agent advantage estimation through Generalized Per-Agent Advantage Estimator, which employs a per-agent value iteration operator to compute precise per-agent advantages.

Abstract

In this paper, we propose a novel framework for multi-agent reinforcement learning that enhances sample efficiency and coordination through accurate per-agent advantage estimation. The core of our approach is Generalized Per-Agent Advantage Estimator (GPAE), which employs a per-agent value iteration operator to compute precise per-agent advantages. This operator enables stable off-policy learning by indirectly estimating values via action probabilities, eliminating the need for direct Q-function estimation. To further refine estimation, we introduce a double-truncated importance sampling ratio scheme. This scheme improves credit assignment for off-policy trajectories by balancing sensitivity to the agent's own policy changes with robustness to non-stationarity from other agents. Experiments on benchmarks demonstrate that our approach outperforms existing approaches, excelling in coordination and sample efficiency for complex scenarios.
Paper Structure (20 sections, 3 theorems, 21 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 3 theorems, 21 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.1

$\mathcal{R}_{\text{on}}^i$ is a $\gamma$-contraction, which means that $\overline{EQ}^i$ converges to the unique fixed point. If $~\lambda = 1$, the fixed point is $\mathbb{E}_{a^i \sim \pi^i}[Q^{\boldsymbol{\pi}}(s, a^i, \boldsymbol{a}^{-i})]$.

Figures (4)

  • Figure 1: \ref{['subfig:motivation_advantage_gap']} Advantage gap $\Delta A$ indicating how strongly each method penalizes the anomalous agent. Higher $\Delta A$ means more effective credit assignment against the “stop” action. \ref{['subfig:motivation_performance']} Average win rates, illustrating overall learning stability and performance.
  • Figure 2: Off-policy correction comparisons in the SMAX-1s1z task. \ref{['subfig:dt_diff1']} Distance from true $\rho^i$, \ref{['subfig:dt_diff2']} Distance from joint true $\boldsymbol{\rho}$, \ref{['subfig:dt_diff_diff']} Gap between (a) and (b), \ref{['subfig:dt_performance']} Final performance. Legend is shared across all plots. DT-ISR demonstrates the lowest gap $\Delta c^i$ and the highest performance.
  • Figure 3: Illustration of the two evaluation environments.
  • Figure 4: (a) and (b) represent the learning curve of the aggregated performance for all tasks on SMAX and MABrax, respectively.

Theorems & Definitions (5)

  • Definition 4.1
  • Theorem 4.1
  • Theorem 4.2
  • Definition 4.2
  • Theorem 4.3