Table of Contents
Fetching ...

Maintaining cooperation in complex social dilemmas using deep reinforcement learning

Adam Lerer, Alexander Peysakhovich

TL;DR

The paper tackles sustaining cooperation in two-player Markov social dilemmas by introducing Approximate Markov Tit-for-Tat (amTFT), a method that learns cooperative and punitive policies via modified self-play and switches between them within a single interaction based on a per-step debit derived from value estimates. A pi^D-dominance framework and an analytic switching rule underpin theoretical guarantees that amTFT can enforce cooperation against defectors under suitable conditions. Empirical results in Coins and Pong Dilemma show amTFT achieves near-cooperative outcomes with itself, resists exploitation, and outperforms Grim in robustness, with the added benefit of effective teaching to learners. The work demonstrates that simple, interpretable mechanisms integrated with deep RL can scale to high-dimensional settings while preserving cooperative behavior, and it discusses future directions for focal points, human-AI interaction, and theory-guided cooperation.

Abstract

Social dilemmas are situations where individuals face a temptation to increase their payoffs at a cost to total welfare. Building artificially intelligent agents that achieve good outcomes in these situations is important because many real world interactions include a tension between selfish interests and the welfare of others. We show how to modify modern reinforcement learning methods to construct agents that act in ways that are simple to understand, nice (begin by cooperating), provokable (try to avoid being exploited), and forgiving (try to return to mutual cooperation). We show both theoretically and experimentally that such agents can maintain cooperation in Markov social dilemmas. Our construction does not require training methods beyond a modification of self-play, thus if an environment is such that good strategies can be constructed in the zero-sum case (eg. Atari) then we can construct agents that solve social dilemmas in this environment.

Maintaining cooperation in complex social dilemmas using deep reinforcement learning

TL;DR

The paper tackles sustaining cooperation in two-player Markov social dilemmas by introducing Approximate Markov Tit-for-Tat (amTFT), a method that learns cooperative and punitive policies via modified self-play and switches between them within a single interaction based on a per-step debit derived from value estimates. A pi^D-dominance framework and an analytic switching rule underpin theoretical guarantees that amTFT can enforce cooperation against defectors under suitable conditions. Empirical results in Coins and Pong Dilemma show amTFT achieves near-cooperative outcomes with itself, resists exploitation, and outperforms Grim in robustness, with the added benefit of effective teaching to learners. The work demonstrates that simple, interpretable mechanisms integrated with deep RL can scale to high-dimensional settings while preserving cooperative behavior, and it discusses future directions for focal points, human-AI interaction, and theory-guided cooperation.

Abstract

Social dilemmas are situations where individuals face a temptation to increase their payoffs at a cost to total welfare. Building artificially intelligent agents that achieve good outcomes in these situations is important because many real world interactions include a tension between selfish interests and the welfare of others. We show how to modify modern reinforcement learning methods to construct agents that act in ways that are simple to understand, nice (begin by cooperating), provokable (try to avoid being exploited), and forgiving (try to return to mutual cooperation). We show both theoretically and experimentally that such agents can maintain cooperation in Markov social dilemmas. Our construction does not require training methods beyond a modification of self-play, thus if an environment is such that good strategies can be constructed in the zero-sum case (eg. Atari) then we can construct agents that solve social dilemmas in this environment.

Paper Structure

This paper contains 16 sections, 1 theorem, 14 equations, 5 figures, 1 algorithm.

Key Result

Theorem 1

Define $d^{*} = \max_{\mathcal{A}_2, s} (Q^2_{CC}(s, \pi^C_1(s), a) - Q^2_{CC}(s, \pi_1^C(s), \pi_2^C(s))).$ If for any state $s$ we have that $V_2 (s, \pi_1^C, \pi^C_2) - V_2 (s, \pi_1^D, \pi_2^D) > \frac{d^{*}}{\delta}$ then if player $1$ is an amTFT agent, a fully omniscient player $2$ maximizes

Figures (5)

  • Figure 1: In two Markov social dilemmas we find that standard self-play converges to defecting strategies while modified self-play finds cooperative, but exploitable strategies. We use the results of these two training schedules to construct $\hat{\pi}^C$ and $\hat{\pi}^D$.
  • Figure 2: In two Markov social dilemmas, amTFT satisfies the Axelrod desiderata: it mostly cooperates with itself, is robust against defectors, and incentivizes cooperation from its partner. The 'Grim' strategy based on de2008polynomial behaves almost identically to pure defection in these social dilemmas. The result of standard self-play is $\pi^D.$ The full tournament of all strategies against each other is shown in the Appendix.
  • Figure 3: Both purely selfish and purely cooperative Teachers lead Learners to exploitative strategies. However, amTFT Teachers lead Learners to cooperate and thus both agents reach a higher payoff in the long-run.
  • Figure 4: Results from training one-memory strategies using policy gradient in the repeated Prisoner's Dilemma. Even in extremely favorable conditions self-play fails to discover cooperation maintaining strategies. Note that temptation payoff $.5$ is not a PD and here $C$ is a dominant strategy in the stage game.
  • Figure 5: Results of the tournament in two Markov social dilemmas. Each cell contains the average total reward of the row strategy against the column strategy. amTFT achieves close to cooperative payoffs with itself and achieves close to the defect payoff against defectors. Its partner also receives a higher payoff for cooperation than defection.

Theorems & Definitions (6)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Theorem 1