Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning

Aditya Kapoor; Sushant Swamy; Kale-ab Tessera; Mayank Baranwal; Mingfei Sun; Harshad Khadilkar; Stefano V. Albrecht

Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning

Aditya Kapoor, Sushant Swamy, Kale-ab Tessera, Mayank Baranwal, Mingfei Sun, Harshad Khadilkar, Stefano V. Albrecht

TL;DR

The paper tackles sparse, delayed rewards in cooperative multi-agent reinforcement learning by introducing Temporal-Agent Reward Redistribution (TAR$^2$), which densifies the episodic global return across time steps and agents. TAR$^2$ learns per-time-step per-agent credits, enabling a redistributed reward function $\oldsymbol{\mathcal{R}}_{\omega,\kappa}$ that is equivalent to potential-based shaping, thereby preserving the optimal policy. It further establishes that policy-gradient updates under redistribution have the same direction as the original updates, ensuring learning trajectories remain consistent, while providing architectural details to learn the temporal and agent weights. Empirically, TAR$^2$ yields faster convergence and improved stability on SMACLite benchmarks and remains competitive with single-agent RL algorithms when integrated, indicating scalable applicability to MARL without compromising theoretical guarantees. The work offers a practical, policy-preserving approach to credit assignment that can be integrated with existing MARL and single-agent methods, with promising future directions in attention-guided weight inference and transfer learning.

Abstract

In multi-agent environments, agents often struggle to learn optimal policies due to sparse or delayed global rewards, particularly in long-horizon tasks where it is challenging to evaluate actions at intermediate time steps. We introduce Temporal-Agent Reward Redistribution (TAR$^2$), a novel approach designed to address the agent-temporal credit assignment problem by redistributing sparse rewards both temporally and across agents. TAR$^2$ decomposes sparse global rewards into time-step-specific rewards and calculates agent-specific contributions to these rewards. We theoretically prove that TAR$^2$ is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirical results demonstrate that TAR$^2$ stabilizes and accelerates the learning process. Additionally, we show that when TAR$^2$ is integrated with single-agent reinforcement learning algorithms, it performs as well as or better than traditional multi-agent reinforcement learning methods.

Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning

TL;DR

The paper tackles sparse, delayed rewards in cooperative multi-agent reinforcement learning by introducing Temporal-Agent Reward Redistribution (TAR

), which densifies the episodic global return across time steps and agents. TAR

learns per-time-step per-agent credits, enabling a redistributed reward function

that is equivalent to potential-based shaping, thereby preserving the optimal policy. It further establishes that policy-gradient updates under redistribution have the same direction as the original updates, ensuring learning trajectories remain consistent, while providing architectural details to learn the temporal and agent weights. Empirically, TAR

yields faster convergence and improved stability on SMACLite benchmarks and remains competitive with single-agent RL algorithms when integrated, indicating scalable applicability to MARL without compromising theoretical guarantees. The work offers a practical, policy-preserving approach to credit assignment that can be integrated with existing MARL and single-agent methods, with promising future directions in attention-guided weight inference and transfer learning.

Abstract

), a novel approach designed to address the agent-temporal credit assignment problem by redistributing sparse rewards both temporally and across agents. TAR

decomposes sparse global rewards into time-step-specific rewards and calculates agent-specific contributions to these rewards. We theoretically prove that TAR

is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirical results demonstrate that TAR

stabilizes and accelerates the learning process. Additionally, we show that when TAR

is integrated with single-agent reinforcement learning algorithms, it performs as well as or better than traditional multi-agent reinforcement learning methods.

Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning

TL;DR

Abstract

Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (5)