STAS: Spatial-Temporal Return Decomposition for Multi-agent Reinforcement Learning

Sirui Chen; Zhaowei Zhang; Yaodong Yang; Yali Du

STAS: Spatial-Temporal Return Decomposition for Multi-agent Reinforcement Learning

Sirui Chen, Zhaowei Zhang, Yaodong Yang, Yali Du

TL;DR

This work tackles credit assignment in cooperative multi-agent reinforcement learning under episodic delayed rewards by introducing STAS, a Spatial-Temporal Attention with Shapley value framework. STAS couples a temporal attention module that captures sequence-level dynamics with a spatial Shapley attention module that distributes the global return to agents based on their contributions, approximated via Monte Carlo sampling with masked self-attention. The approach defines a Spatial-Temporal Return Decomposition problem and demonstrates superior performance and stability on the Alice & Bob task and multi-agent particle environments, outperforming strong baselines like QMIX, COMA, and SQDDPG. The results indicate that explicit spatial credit distribution combined with temporal reconciliation yields more accurate and efficient credit assignment, enabling robust learning in highly delayed reward scenarios with scalable multi-agent settings.

Abstract

Centralized Training with Decentralized Execution (CTDE) has been proven to be an effective paradigm in cooperative multi-agent reinforcement learning (MARL). One of the major challenges is credit assignment, which aims to credit agents by their contributions. While prior studies have shown great success, their methods typically fail to work in episodic reinforcement learning scenarios where global rewards are revealed only at the end of the episode. They lack the functionality to model complicated relations of the delayed global reward in the temporal dimension and suffer from inefficiencies. To tackle this, we introduce Spatial-Temporal Attention with Shapley (STAS), a novel method that learns credit assignment in both temporal and spatial dimensions. It first decomposes the global return back to each time step, then utilizes the Shapley Value to redistribute the individual payoff from the decomposed global reward. To mitigate the computational complexity of the Shapley Value, we introduce an approximation of marginal contribution and utilize Monte Carlo sampling to estimate it. We evaluate our method on an Alice & Bob example and MPE environments across different scenarios. Our results demonstrate that our method effectively assigns spatial-temporal credit, outperforming all state-of-the-art baselines.

STAS: Spatial-Temporal Return Decomposition for Multi-agent Reinforcement Learning

TL;DR

Abstract

Paper Structure (18 sections, 14 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 14 equations, 5 figures, 1 table, 1 algorithm.

Introduction
Related Work
Preliminaries
Method
Spatial Decomposition: Shapley Value
Temporal Decomposition
Overall objective
Experiments
Evaluate Algorithms
Extreme delayed reward environment
Environment settings
Results
Multi-agent Particle Environment
Environment settings
Results
...and 3 more sections

Figures (5)

Figure 1: STAS framework. It contains a temporal attention module and a spatial Shapley attention module. Global states and actions of the entire episode are encoded and fed into the temporal attention module, which uses position embedding and time causality mask. The spatial Shapley attention module then approximates the Shapley value from previously learned representations in the spatial dimension. Finally, the model is updated based on the Shapley value approximation.
Figure 2: A simple demonstration of the designed extreme delayed reward environment Alice & Bob. When a key is reached, the corresponding door will open. To obtain the treasure, Alice must first retrieve the brown key to unlock the door to Bob's room. After Alice has unlocked the door, Bob can then retrieve the green key to open the door to Alice's room. With both doors unlocked, they can proceed to the treasure together.
Figure 3: Average agent rewards and reaching treasure rate with standard deviation for task Alice & Bob.
Figure 4: Average agent rewards with standard deviation for scenario Cooperative Navigation and Predator-prey in Multi-agent Particle Environments.
Figure 5: The average reward of STAS(a) and STAS-ML(b) in task predator-prey(3 agents)

STAS: Spatial-Temporal Return Decomposition for Multi-agent Reinforcement Learning

TL;DR

Abstract

STAS: Spatial-Temporal Return Decomposition for Multi-agent Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)