Reciprocal Reward Influence Encourages Cooperation From Self-Interested Agents

John L. Zhou; Weizhe Hong; Jonathan C. Kao

Reciprocal Reward Influence Encourages Cooperation From Self-Interested Agents

John L. Zhou, Weizhe Hong, Jonathan C. Kao

TL;DR

The paper addresses cooperation among self-interested agents in sequential social dilemmas by introducing Reciprocators, RL agents that intrinsically reward reciprocating the influence others have on their returns. The core idea combines a counterfactual value-influence measure with an influence-balance tracker to generate a reciprocal reward that guides opponents toward mutually beneficial actions, without differentiating through their policies or requiring meta-game optimization. The approach yields state-of-the-art cooperative outcomes in IPD and Coins during simultaneous learning and demonstrates resilience to higher-order exploitation, while relying only on first-order RL and standard training procedures. This work promises a sample-efficient, learning-rule-agnostic pathway to cooperative AI in mixed-motivation multi-agent environments and highlights considerations for practical deployment and future extensions.

Abstract

Cooperation between self-interested individuals is a widespread phenomenon in the natural world, but remains elusive in interactions between artificially intelligent agents. Instead, naive reinforcement learning algorithms typically converge to Pareto-dominated outcomes in even the simplest of social dilemmas. An emerging literature on opponent shaping has demonstrated the ability to reach prosocial outcomes by influencing the learning of other agents. However, such methods differentiate through the learning step of other agents or optimize for meta-game dynamics, which rely on privileged access to opponents' learning algorithms or exponential sample complexity, respectively. To provide a learning rule-agnostic and sample-efficient alternative, we introduce Reciprocators, reinforcement learning agents which are intrinsically motivated to reciprocate the influence of opponents' actions on their returns. This approach seeks to modify other agents' $Q$-values by increasing their return following beneficial actions (with respect to the Reciprocator) and decreasing it after detrimental actions, guiding them towards mutually beneficial actions without directly differentiating through a model of their policy. We show that Reciprocators can be used to promote cooperation in temporally extended social dilemmas during simultaneous learning. Our code is available at https://github.com/johnlyzhou/reciprocator/.

Reciprocal Reward Influence Encourages Cooperation From Self-Interested Agents

TL;DR

Abstract

-values by increasing their return following beneficial actions (with respect to the Reciprocator) and decreasing it after detrimental actions, guiding them towards mutually beneficial actions without directly differentiating through a model of their policy. We show that Reciprocators can be used to promote cooperation in temporally extended social dilemmas during simultaneous learning. Our code is available at https://github.com/johnlyzhou/reciprocator/.

Paper Structure (22 sections, 6 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 6 equations, 5 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Cooperation through Influence
Opponent Shaping
Preliminaries
Reciprocal Reward Influence
1-Step Value Influence
Keeping Score with Influence Balances
Intrinsic Reciprocal Reward
Policy Optimization with Reciprocal Rewards
Experiments
Sequential Social Dilemmas
Baselines
Implementation Details
Results
...and 7 more sections

Figures (5)

Figure 1: (a) The first number in each cell denotes the reward received by the agent taking the row action, and the second the reward received by the agent taking the column action, where C: cooperate (stay silent) and D: defect (confess). (b) Two agents (red and blue) are tasked with collecting randomly spawning coins. If an agent collects its own coin, it receives a reward of +1 (left). If an agent collects another's coin, then it receives a reward of +1 but the other agent receives a punishment of -2.
Figure 2: Representative run of a Reciprocator vs. an NL in IPD-Rollout. Average reciprocal reward per step (left axis) and probability of cooperation (right axis) over the course of an episode.
Figure 3: Shaping an NL in Coins. Proportion of own coins collected by NL during training when facing each opponent (left) and coin counts by type for Reciprocator vs. NL (right). Reciprocator and NL-PPO results are plotted on a scale of single episodes (bottom axis) whereas MFOS results are plotted on a scale of meta-episodes, where one meta-episode contains 16 episodes (top axis).
Figure 4: Head-to-head results in symmetric Coins (two agents of the same kind). Total (extrinsic) reward per episode (left), proportion of own coins collected (right). Again, Reciprocator and NL-PPO results are plotted on a scale of episodes and MFOS results are plotted on a scale of meta-episodes.
Figure 5: Total number of coins per 32 steps collected by NL-PPO (right) vs. each baseline in Coins.

Reciprocal Reward Influence Encourages Cooperation From Self-Interested Agents

TL;DR

Abstract

Reciprocal Reward Influence Encourages Cooperation From Self-Interested Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (5)