Reciprocal Reward Influence Encourages Cooperation From Self-Interested Agents
John L. Zhou, Weizhe Hong, Jonathan C. Kao
TL;DR
The paper addresses cooperation among self-interested agents in sequential social dilemmas by introducing Reciprocators, RL agents that intrinsically reward reciprocating the influence others have on their returns. The core idea combines a counterfactual value-influence measure with an influence-balance tracker to generate a reciprocal reward that guides opponents toward mutually beneficial actions, without differentiating through their policies or requiring meta-game optimization. The approach yields state-of-the-art cooperative outcomes in IPD and Coins during simultaneous learning and demonstrates resilience to higher-order exploitation, while relying only on first-order RL and standard training procedures. This work promises a sample-efficient, learning-rule-agnostic pathway to cooperative AI in mixed-motivation multi-agent environments and highlights considerations for practical deployment and future extensions.
Abstract
Cooperation between self-interested individuals is a widespread phenomenon in the natural world, but remains elusive in interactions between artificially intelligent agents. Instead, naive reinforcement learning algorithms typically converge to Pareto-dominated outcomes in even the simplest of social dilemmas. An emerging literature on opponent shaping has demonstrated the ability to reach prosocial outcomes by influencing the learning of other agents. However, such methods differentiate through the learning step of other agents or optimize for meta-game dynamics, which rely on privileged access to opponents' learning algorithms or exponential sample complexity, respectively. To provide a learning rule-agnostic and sample-efficient alternative, we introduce Reciprocators, reinforcement learning agents which are intrinsically motivated to reciprocate the influence of opponents' actions on their returns. This approach seeks to modify other agents' $Q$-values by increasing their return following beneficial actions (with respect to the Reciprocator) and decreasing it after detrimental actions, guiding them towards mutually beneficial actions without directly differentiating through a model of their policy. We show that Reciprocators can be used to promote cooperation in temporally extended social dilemmas during simultaneous learning. Our code is available at https://github.com/johnlyzhou/reciprocator/.
