Table of Contents
Fetching ...

GRASP: Gradient Realignment via Active Shared Perception for Multi-Agent Collaborative Optimization

Sihan Zhou, Tiantian He, Yifan Lu, Yaqing Hou, Yew-Soon Ong

Abstract

Non-stationarity arises from concurrent policy updates and leads to persistent environmental fluctuations. Existing approaches like Centralized Training with Decentralized Execution (CTDE) and sequential update schemes mitigate this issue. However, since the perception of the policies of other agents remains dependent on sampling environmental interaction data, the agent essentially operates in a passive perception state. This inevitably triggers equilibrium oscillations and significantly slows the convergence speed of the system. To address this issue, we propose Gradient Realignment via Active Shared Perception (GRASP), a novel framework that defines generalized Bellman equilibrium as a stable objective for policy evolution. The core mechanism of GRASP involves utilizing the independent gradients of agents to derive a defined consensus gradient, enabling agents to actively perceive policy updates and optimize team collaboration. Theoretically, we leverage the Kakutani Fixed-Point Theorem to prove that the consensus direction $u^*$ guarantees the existence and attainability of this equilibrium. Extensive experiments on StarCraft II Multi-Agent Challenge (SMAC) and Google Research Football (GRF) demonstrate the scalability and promising performance of the framework.

GRASP: Gradient Realignment via Active Shared Perception for Multi-Agent Collaborative Optimization

Abstract

Non-stationarity arises from concurrent policy updates and leads to persistent environmental fluctuations. Existing approaches like Centralized Training with Decentralized Execution (CTDE) and sequential update schemes mitigate this issue. However, since the perception of the policies of other agents remains dependent on sampling environmental interaction data, the agent essentially operates in a passive perception state. This inevitably triggers equilibrium oscillations and significantly slows the convergence speed of the system. To address this issue, we propose Gradient Realignment via Active Shared Perception (GRASP), a novel framework that defines generalized Bellman equilibrium as a stable objective for policy evolution. The core mechanism of GRASP involves utilizing the independent gradients of agents to derive a defined consensus gradient, enabling agents to actively perceive policy updates and optimize team collaboration. Theoretically, we leverage the Kakutani Fixed-Point Theorem to prove that the consensus direction guarantees the existence and attainability of this equilibrium. Extensive experiments on StarCraft II Multi-Agent Challenge (SMAC) and Google Research Football (GRF) demonstrate the scalability and promising performance of the framework.

Paper Structure

This paper contains 34 sections, 7 theorems, 44 equations, 5 figures, 10 tables, 2 algorithms.

Key Result

Theorem 3.2

When a $u^*$ satisfying the definition can be obtained, the MAS can solve for $\pi^*_i$ such that $\lim_{t \to \infty} \pi_i^{(t)} = \pi_i^*$, where $\pi_i^{(t)} = \pi_i^{(t-1)} + u^{(t)} + g_i^{(t)}, \quad t = 1, 2, \dots$. The $f( \cdot): \{g_1, \dots, g_N\} \to u^*$, we named the consensus operat

Figures (5)

  • Figure 1: Comparison of win rate learning curves between GRASP and baseline algorithms (MAPPO, HAPPO, and $\text{MA}^2\text{E}$) across six scenarios in SMAC and SMACv2. Solid lines represent the average win rate from multiple runs, while shaded areas indicate standard deviation.
  • Figure 2: Comparison of win rate learning curves between GRASP and baseline algorithms (MAPPO and $\text{MA}^2\text{E}$) across three scenarios in GRF. Solid lines represent the average win rate from multiple runs, while shaded areas indicate standard deviation.
  • Figure 3: Comparison of win rate learning curves between GRASP and MAPPO in the Simple Spread scenario of MPE. Solid lines represent the average win rate from multiple runs, while shaded areas indicate standard deviation.
  • Figure 4: Comparison of win rate learning curves between GRASP-QMIX and QMIX across three scenarios in SMAC. Solid lines represent the average win rate from multiple runs, while shaded areas indicate standard deviation.
  • Figure 5: Comparison of win rate learning curves between GRASP and MAPPO in 27m_vs_30m scenario of MPE. Solid lines represent the average win rate from multiple runs, while shaded areas indicate standard deviation.

Theorems & Definitions (14)

  • Definition 3.1: Optimal Collaboration Policy
  • Theorem 3.2: Gradient Realignment via Active Shared Perception Framework
  • Proposition 3.3: Consensus Equilibrium as Generalized Bellman Equilibrium
  • Theorem 3.5: GRASP Equilibrium Point Existence
  • proof
  • Proposition 3.6: Consensus-Driven Update Mechanism
  • proof
  • Proposition 3.7: QP Consensus Operator
  • proof
  • Theorem 5.1
  • ...and 4 more