Table of Contents
Fetching ...

Redistributing Rewards Across Time and Agents for Multi-Agent Reinforcement Learning

Aditya Kapoor, Kale-ab Tessera, Mayank Baranwal, Harshad Khadilkar, Jan Peters, Stefano Albrecht, Mingfei Sun

TL;DR

Credit assignment in episodic multi-agent reinforcement learning is challenging due to coupled temporal and agent contributions. The paper introduces TAR², which learns unnormalized contribution scores and applies deterministic normalization to enforce return equivalence, proving equivalence to PBRS to preserve the optimal policy. This decouples credit modeling from constraint satisfaction, reducing gradient variance and improving learning stability. Empirical results on SMACLite and Google Research Football show TAR² accelerates learning and achieves higher final performance than strong baselines and oracle baselines.

Abstract

Credit assignmen, disentangling each agent's contribution to a shared reward, is a critical challenge in cooperative multi-agent reinforcement learning (MARL). To be effective, credit assignment methods must preserve the environment's optimal policy. Some recent approaches attempt this by enforcing return equivalence, where the sum of distributed rewards must equal the team reward. However, their guarantees are conditional on a learned model's regression accuracy, making them unreliable in practice. We introduce Temporal-Agent Reward Redistribution (TAR$^2$), an approach that decouples credit modeling from this constraint. A neural network learns unnormalized contribution scores, while a separate, deterministic normalization step enforces return equivalence by construction. We demonstrate that this method is equivalent to a valid Potential-Based Reward Shaping (PBRS), which guarantees the optimal policy is preserved regardless of model accuracy. Empirically, on challenging SMACLite and Google Research Football (GRF) benchmarks, TAR$^2$ accelerates learning and achieves higher final performance than strong baselines. These results establish our method as an effective solution for the agent-temporal credit assignment problem.

Redistributing Rewards Across Time and Agents for Multi-Agent Reinforcement Learning

TL;DR

Credit assignment in episodic multi-agent reinforcement learning is challenging due to coupled temporal and agent contributions. The paper introduces TAR², which learns unnormalized contribution scores and applies deterministic normalization to enforce return equivalence, proving equivalence to PBRS to preserve the optimal policy. This decouples credit modeling from constraint satisfaction, reducing gradient variance and improving learning stability. Empirical results on SMACLite and Google Research Football show TAR² accelerates learning and achieves higher final performance than strong baselines and oracle baselines.

Abstract

Credit assignmen, disentangling each agent's contribution to a shared reward, is a critical challenge in cooperative multi-agent reinforcement learning (MARL). To be effective, credit assignment methods must preserve the environment's optimal policy. Some recent approaches attempt this by enforcing return equivalence, where the sum of distributed rewards must equal the team reward. However, their guarantees are conditional on a learned model's regression accuracy, making them unreliable in practice. We introduce Temporal-Agent Reward Redistribution (TAR), an approach that decouples credit modeling from this constraint. A neural network learns unnormalized contribution scores, while a separate, deterministic normalization step enforces return equivalence by construction. We demonstrate that this method is equivalent to a valid Potential-Based Reward Shaping (PBRS), which guarantees the optimal policy is preserved regardless of model accuracy. Empirically, on challenging SMACLite and Google Research Football (GRF) benchmarks, TAR accelerates learning and achieves higher final performance than strong baselines. These results establish our method as an effective solution for the agent-temporal credit assignment problem.

Paper Structure

This paper contains 40 sections, 2 theorems, 12 equations, 4 figures, 4 tables, 2 algorithms.

Key Result

Proposition 3.1

Let $\mathcal{M}_{\text{env}}$ be a Dec-POMDP where agent $i$ receives reward $r_{i,t}^{\text{orig}}$. Let $\mathcal{M}_{\text{TAR}^2}$ be an identical environment where agent $i$ receives the augmented reward $r'_{i,t} = r_{i,t}^{\text{orig}} + s_{i,t}$. Any joint policy $\pi^*$ optimal in $\mathca

Figures (4)

  • Figure 1: The TAR² architecture processes trajectory data through four main stages. (1) Input sequences are converted into embeddings with positional encoding. (2) A multi-layer transformer block with sequential Temporal and Agent Attention builds context-aware representations, regularized by an auxiliary Inverse Dynamics task to ensure causality. (3) The Score Network computes unnormalized scores by conditioning each timestep's representation on a learned Final Outcome Embedding (Z). (4) A final Probabilistic Normalization step converts these scores and the global reward $R(s_T)$ into dense, per-agent rewards $\{r_{i,t}\}$ that satisfy strict return equivalence.
  • Figure 2: TAR²'s Average Return comparison against baselines on Google Research Football (A-C) and SMACLite (D-F). On SMACLite, it demonstrates improved sample efficiency compared to STAS and converges to a higher average return than the unstable AREL variants. This trend is more pronounced on GRF, where TAR² consistently achieves the highest average return, particularly in 'Counter Attack Easy' and 'Pass and Shoot' scenarios.
  • Figure 3: Performance of TAR² relative to oracle rewards on SMACLite (D-F) and Google Research Football (A-C). TAR² enables learning a policy that is competitive with hand-crafted reward functions. TAR²'s performance rivals 'Temporal-Agent' in SMACLite and 'Temporal' in GRF. It outperforms all other heuristics, demonstrating that a learned credit assignment can be more effective than a manually engineered one.
  • Figure 4: Ablation study of TAR²'s core components. Removing any component degrades performance. 'No-Final-Outcome' increases variance, 'No-Inverse-Dynamics' hinders performance, and 'No-Normalization' is the most detrimental as it violates the policy preservation guarantee.

Theorems & Definitions (4)

  • Proposition 3.1: Optimal Policy Preservation
  • proof
  • Proposition 3.2: Stochastic Gradient Direction Preservation
  • proof