Table of Contents
Fetching ...

Difference Rewards Policy Gradients

Jacopo Castellini, Sam Devlin, Frans A. Oliehoek, Rahul Savani

TL;DR

The paper tackles multi-agent credit assignment under centralized training with decentralized execution by introducing Dr.Reinforce, a policy-gradient method that leverages difference rewards to credit individual agents for the shared outcome. When the reward function is unknown, Dr.ReinforceR learns a centralized reward network to estimate difference rewards, avoiding the instability of learning a $Q$-function as in COMA. Through theoretical results and extensive experiments on gridworlds and StarCraft II, the authors show that difference-reward-based gradients can accelerate learning and scale better with more agents, with reward learning offering robust performance where the true reward is inaccessible. The work highlights a practical, scalable alternative to value-based critics for multi-agent cooperation and emphasizes the potential of reward learning to improve policy signals in complex, partially observable domains.

Abstract

Policy gradient methods have become one of the most popular classes of algorithms for multi-agent reinforcement learning. A key challenge, however, that is not addressed by many of these methods is multi-agent credit assignment: assessing an agent's contribution to the overall performance, which is crucial for learning good policies. We propose a novel algorithm called Dr.Reinforce that explicitly tackles this by combining difference rewards with policy gradients to allow for learning decentralized policies when the reward function is known. By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the Q-function as done by Counterfactual Multiagent Policy Gradients (COMA), a state-of-the-art difference rewards method. For applications where the reward function is unknown, we show the effectiveness of a version of Dr.Reinforce that learns an additional reward network that is used to estimate the difference rewards.

Difference Rewards Policy Gradients

TL;DR

The paper tackles multi-agent credit assignment under centralized training with decentralized execution by introducing Dr.Reinforce, a policy-gradient method that leverages difference rewards to credit individual agents for the shared outcome. When the reward function is unknown, Dr.ReinforceR learns a centralized reward network to estimate difference rewards, avoiding the instability of learning a -function as in COMA. Through theoretical results and extensive experiments on gridworlds and StarCraft II, the authors show that difference-reward-based gradients can accelerate learning and scale better with more agents, with reward learning offering robust performance where the true reward is inaccessible. The work highlights a practical, scalable alternative to value-based critics for multi-agent cooperation and emphasizes the potential of reward learning to improve policy signals in complex, partially observable domains.

Abstract

Policy gradient methods have become one of the most popular classes of algorithms for multi-agent reinforcement learning. A key challenge, however, that is not addressed by many of these methods is multi-agent credit assignment: assessing an agent's contribution to the overall performance, which is crucial for learning good policies. We propose a novel algorithm called Dr.Reinforce that explicitly tackles this by combining difference rewards with policy gradients to allow for learning decentralized policies when the reward function is known. By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the Q-function as done by Counterfactual Multiagent Policy Gradients (COMA), a state-of-the-art difference rewards method. For applications where the reward function is unknown, we show the effectiveness of a version of Dr.Reinforce that learns an additional reward network that is used to estimate the difference rewards.

Paper Structure

This paper contains 25 sections, 3 theorems, 32 equations, 13 figures, 2 tables.

Key Result

Lemma 1

In a MMDP, using difference return $\Delta G^i_t(a_{t:T}^i\vert s_{t:T},a_{t:T}^{-i})$ as the learning signal for policy gradients in Equation eq:drpg is equivalent to subtracting an unbiased baseline $B^i(s_{t:T},a^{-i}_{t:T})$ from the distributed policy gradients in Equation eq:mapg.

Figures (13)

  • Figure 1: Schematic representation of the two gridworld domains. Agents are green, landmarks are yellow, and the prey is red.
  • Figure 2: Training curves on the multi-rover domain (left) and the predator-prey problem (right), showing the median reward and $25-75\%$ percentiles across seeds.
  • Figure 3: Normalized mean prediction error and standard deviation for Dr.ReinforceR reward network $R_{\psi}$ and COMA critic $Q_{\omega}$ on the on-policy dataset (first row) and the off-policy dataset (second row), for the two environments.
  • Figure 4: Mean and variance of difference rewards for a set of samples under different noise profiles.
  • Figure 5: Training curves on the entire set of SMAC maps, showing the median return and $25-75\%$ percentiles across seeds.
  • ...and 8 more figures

Theorems & Definitions (7)

  • Lemma 1
  • proof
  • Corollary 1
  • proof
  • Theorem 1
  • proof
  • proof