Table of Contents
Fetching ...

Intrinsic Action Tendency Consistency for Cooperative Multi-Agent Reinforcement Learning

Junkai Zhang, Yifan Zhang, Xi Sheryl Zhang, Yifan Zang, Jian Cheng

TL;DR

This paper addresses low sample efficiency in CTDE-based cooperative MARL by identifying divergent action tendencies as a key bottleneck. It introduces Intrinsic Action Tendency Consistency (IAM), which uses an action-model to generate intrinsic rewards based on prediction errors of surrounding agents' expectations about a central agent's actions, and integrates these rewards via Reward-Additive CTDE (RA-CTDE) to preserve CTDE gradients. The authors provide theoretical grounding that RA-CTDE is equivalent to CTDE in gradient updates and demonstrate strong empirical gains on SMAC and GRF when combining IAM with QMIX or VDN, including robustness in sparse-reward settings. The work offers a practical approach to improve credit assignment and coordination in cooperative MARL, with significant implications for scalable, sample-efficient multi-agent training.

Abstract

Efficient collaboration in the centralized training with decentralized execution (CTDE) paradigm remains a challenge in cooperative multi-agent systems. We identify divergent action tendencies among agents as a significant obstacle to CTDE's training efficiency, requiring a large number of training samples to achieve a unified consensus on agents' policies. This divergence stems from the lack of adequate team consensus-related guidance signals during credit assignments in CTDE. To address this, we propose Intrinsic Action Tendency Consistency, a novel approach for cooperative multi-agent reinforcement learning. It integrates intrinsic rewards, obtained through an action model, into a reward-additive CTDE (RA-CTDE) framework. We formulate an action model that enables surrounding agents to predict the central agent's action tendency. Leveraging these predictions, we compute a cooperative intrinsic reward that encourages agents to match their actions with their neighbors' predictions. We establish the equivalence between RA-CTDE and CTDE through theoretical analyses, demonstrating that CTDE's training process can be achieved using agents' individual targets. Building on this insight, we introduce a novel method to combine intrinsic rewards and CTDE. Extensive experiments on challenging tasks in SMAC and GRF benchmarks showcase the improved performance of our method.

Intrinsic Action Tendency Consistency for Cooperative Multi-Agent Reinforcement Learning

TL;DR

This paper addresses low sample efficiency in CTDE-based cooperative MARL by identifying divergent action tendencies as a key bottleneck. It introduces Intrinsic Action Tendency Consistency (IAM), which uses an action-model to generate intrinsic rewards based on prediction errors of surrounding agents' expectations about a central agent's actions, and integrates these rewards via Reward-Additive CTDE (RA-CTDE) to preserve CTDE gradients. The authors provide theoretical grounding that RA-CTDE is equivalent to CTDE in gradient updates and demonstrate strong empirical gains on SMAC and GRF when combining IAM with QMIX or VDN, including robustness in sparse-reward settings. The work offers a practical approach to improve credit assignment and coordination in cooperative MARL, with significant implications for scalable, sample-efficient multi-agent training.

Abstract

Efficient collaboration in the centralized training with decentralized execution (CTDE) paradigm remains a challenge in cooperative multi-agent systems. We identify divergent action tendencies among agents as a significant obstacle to CTDE's training efficiency, requiring a large number of training samples to achieve a unified consensus on agents' policies. This divergence stems from the lack of adequate team consensus-related guidance signals during credit assignments in CTDE. To address this, we propose Intrinsic Action Tendency Consistency, a novel approach for cooperative multi-agent reinforcement learning. It integrates intrinsic rewards, obtained through an action model, into a reward-additive CTDE (RA-CTDE) framework. We formulate an action model that enables surrounding agents to predict the central agent's action tendency. Leveraging these predictions, we compute a cooperative intrinsic reward that encourages agents to match their actions with their neighbors' predictions. We establish the equivalence between RA-CTDE and CTDE through theoretical analyses, demonstrating that CTDE's training process can be achieved using agents' individual targets. Building on this insight, we introduce a novel method to combine intrinsic rewards and CTDE. Extensive experiments on challenging tasks in SMAC and GRF benchmarks showcase the improved performance of our method.

Paper Structure

This paper contains 43 sections, 4 theorems, 25 equations, 12 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $\{\theta_i\}_{i=1}^{{N}}$ be the parameters of $Q$ functions, $\phi$ be the parameters of the mixing network $\mathcal{F}$ in CTDE, $\mathcal{L}^G$ be the global target in Eq CTDE_LOSS, $\mathcal{N}=\{1,...,N\}$ be the agents set, $\mathcal{Q}^{N}=\{Q_1(\tau_1,a_1; \theta_1), Q_2(\tau_2,a_2; \t

Figures (12)

  • Figure 1: The illustration of the consistent action tendency. In (a) and (b), our agents' health value is lower than the enemies'. At this point, attacking either Enemy 1 or 2 simultaneously are the two best team policies. In (a), Agent 1 and Agent 2 attack enemies separately without agreeing on a team policy. On the contrary, agents in (b) achieve a consistent goal policy and agree to attack a common enemy. To reflect the policy consistency among agents, we propose the concept of action tendency. It reflects the policy distribution of agents toward different actions. We propose this action tendency notion to distinguish it from policy, which is usually the epsilon-greedy of $Q$ functions only concerning the largest output in value-based approaches.
  • Figure 2: IAM-based reward. The blue and green zones represent the receptive field of the central agent and surrounding agents. The action model intrinsic reward is high when agent $i$ takes actions that match their surrounding agents' predictions.
  • Figure 3: IAM training paradigm. The training paradigm consists of two stages: (a) Forward stage and (b) Backward stage. In the forward stage, we use mixing network $\mathcal{F}, Q_{tot}, r^{ext}$ and $Q_T$ in Eq. \ref{['CTDE_LOSS']} to calculate global TD-error target $\mathcal{L}^G$, which is the same as CTDE. In the backward stage, we first factorize $\mathcal{L}^G$ into ${N}$ targets $\{\mathcal{L}_i^E\}_{i=1}^{N}$ by Eq \ref{['definition1']} and \ref{['definition2']}, then add intrinsic rewards into them individually to obtain IAM targets: $\{\mathcal{L}_i^{IAM}\}_{i=1}^{N}$. The gradients of $\{\theta_i\}_{i=1}^{N}$ and $\phi$ are separately computed by backpropagating ${N}$ targets $\{\mathcal{L}_i^{IAM}\}_{i=1}^{N}$.
  • Figure 4: Performance comparisons for various maps in SMAC.
  • Figure 5: A visualization example of IAM on 8m_vs_9m. In this task, agents need to obtain a three-stage team policy to win. In stage 1, agents need to be spread out to maximize the distraction of enemy attacks. In stage 2, agents need to maximize the concentration of firepower on the same enemy and reduce the enemy's numbers. In stage 3, agents need to escape quickly when they are low on blood to avoid being attacked and increase survival time. Among them, stage 2 is the hardest to learn because the agents need to cooperate to achieve the same policy target, i.e. action tendency consistency. (a), (b), and (d) represent three team policy stages of QMIX-IAM. (d) exhibits the distributed fire against the enemy of QMIX.
  • ...and 7 more figures

Theorems & Definitions (8)

  • Definition 1
  • Theorem 1
  • Corollary 1
  • Definition 2
  • Theorem 2
  • proof : Proof
  • Corollary 2
  • proof : Proof