Table of Contents
Fetching ...

Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning

Aditya Kapoor, Sushant Swamy, Kale-ab Tessera, Mayank Baranwal, Mingfei Sun, Harshad Khadilkar, Stefano V. Albrecht

TL;DR

The paper tackles sparse, delayed rewards in cooperative multi-agent reinforcement learning by introducing Temporal-Agent Reward Redistribution (TAR$^2$), which densifies the episodic global return across time steps and agents. TAR$^2$ learns per-time-step per-agent credits, enabling a redistributed reward function $\oldsymbol{\mathcal{R}}_{\omega,\kappa}$ that is equivalent to potential-based shaping, thereby preserving the optimal policy. It further establishes that policy-gradient updates under redistribution have the same direction as the original updates, ensuring learning trajectories remain consistent, while providing architectural details to learn the temporal and agent weights. Empirically, TAR$^2$ yields faster convergence and improved stability on SMACLite benchmarks and remains competitive with single-agent RL algorithms when integrated, indicating scalable applicability to MARL without compromising theoretical guarantees. The work offers a practical, policy-preserving approach to credit assignment that can be integrated with existing MARL and single-agent methods, with promising future directions in attention-guided weight inference and transfer learning.

Abstract

In multi-agent environments, agents often struggle to learn optimal policies due to sparse or delayed global rewards, particularly in long-horizon tasks where it is challenging to evaluate actions at intermediate time steps. We introduce Temporal-Agent Reward Redistribution (TAR$^2$), a novel approach designed to address the agent-temporal credit assignment problem by redistributing sparse rewards both temporally and across agents. TAR$^2$ decomposes sparse global rewards into time-step-specific rewards and calculates agent-specific contributions to these rewards. We theoretically prove that TAR$^2$ is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirical results demonstrate that TAR$^2$ stabilizes and accelerates the learning process. Additionally, we show that when TAR$^2$ is integrated with single-agent reinforcement learning algorithms, it performs as well as or better than traditional multi-agent reinforcement learning methods.

Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning

TL;DR

The paper tackles sparse, delayed rewards in cooperative multi-agent reinforcement learning by introducing Temporal-Agent Reward Redistribution (TAR), which densifies the episodic global return across time steps and agents. TAR learns per-time-step per-agent credits, enabling a redistributed reward function that is equivalent to potential-based shaping, thereby preserving the optimal policy. It further establishes that policy-gradient updates under redistribution have the same direction as the original updates, ensuring learning trajectories remain consistent, while providing architectural details to learn the temporal and agent weights. Empirically, TAR yields faster convergence and improved stability on SMACLite benchmarks and remains competitive with single-agent RL algorithms when integrated, indicating scalable applicability to MARL without compromising theoretical guarantees. The work offers a practical, policy-preserving approach to credit assignment that can be integrated with existing MARL and single-agent methods, with promising future directions in attention-guided weight inference and transfer learning.

Abstract

In multi-agent environments, agents often struggle to learn optimal policies due to sparse or delayed global rewards, particularly in long-horizon tasks where it is challenging to evaluate actions at intermediate time steps. We introduce Temporal-Agent Reward Redistribution (TAR), a novel approach designed to address the agent-temporal credit assignment problem by redistributing sparse rewards both temporally and across agents. TAR decomposes sparse global rewards into time-step-specific rewards and calculates agent-specific contributions to these rewards. We theoretically prove that TAR is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirical results demonstrate that TAR stabilizes and accelerates the learning process. Additionally, we show that when TAR is integrated with single-agent reinforcement learning algorithms, it performs as well as or better than traditional multi-agent reinforcement learning methods.

Paper Structure

This paper contains 20 sections, 3 theorems, 19 equations, 1 figure.

Key Result

Theorem 1

Given an $n$-player discounted stochastic game $M = (S, A_1, \ldots, A_n, T, \gamma, R_1, \ldots, R_n)$, we define a transformed $n$-player discounted stochastic game $M' = (S, A_1, \ldots, A_n, T, \gamma, R_1 + F_1, \ldots, R_n + F_n)$, where $F_i \in S \times S$ is a shaping reward function for pl where $\Phi_i: S \to \mathbb{R}$ is a potential function. Then, the potential-based shaping functio

Figures (1)

  • Figure 1: Average agent episodic rewards with standard deviation for task 5m_vs_6m.

Theorems & Definitions (5)

  • Theorem 1
  • Theorem 2
  • proof
  • Proposition 1
  • proof