Table of Contents
Fetching ...

Enhancing Cooperation through Selective Interaction and Long-term Experiences in Multi-Agent Reinforcement Learning

Tianyu Ren, Xiao-Jun Zeng

TL;DR

This paper tackles the emergence of cooperation in social dilemmas by enabling agents to learn both dilemma strategies and neighbour selection within a spatial Prisoner’s Dilemma using multi‑agent reinforcement learning. It introduces a dual Q‑network MARL framework that leverages long‑term experiences to differentiate cooperative from non‑cooperative neighbours, promoting network reciprocity and clustering of strategies. Empirical results show superior cooperation and payoffs compared with evolutionary game theory baselines, with cooperation remaining robust up to a dilemma strength of $b\approx 1.2$ and improving with longer memory. The work provides a scalable, explicit‑network framework for studying the coevolution of cooperation and interaction, with implications for designing cooperative artificial and human‑AI systems.

Abstract

The significance of network structures in promoting group cooperation within social dilemmas has been widely recognized. Prior studies attribute this facilitation to the assortment of strategies driven by spatial interactions. Although reinforcement learning has been employed to investigate the impact of dynamic interaction on the evolution of cooperation, there remains a lack of understanding about how agents develop neighbour selection behaviours and the formation of strategic assortment within an explicit interaction structure. To address this, our study introduces a computational framework based on multi-agent reinforcement learning in the spatial Prisoner's Dilemma game. This framework allows agents to select dilemma strategies and interacting neighbours based on their long-term experiences, differing from existing research that relies on preset social norms or external incentives. By modelling each agent using two distinct Q-networks, we disentangle the coevolutionary dynamics between cooperation and interaction. The results indicate that long-term experience enables agents to develop the ability to identify non-cooperative neighbours and exhibit a preference for interaction with cooperative ones. This emergent self-organizing behaviour leads to the clustering of agents with similar strategies, thereby increasing network reciprocity and enhancing group cooperation.

Enhancing Cooperation through Selective Interaction and Long-term Experiences in Multi-Agent Reinforcement Learning

TL;DR

This paper tackles the emergence of cooperation in social dilemmas by enabling agents to learn both dilemma strategies and neighbour selection within a spatial Prisoner’s Dilemma using multi‑agent reinforcement learning. It introduces a dual Q‑network MARL framework that leverages long‑term experiences to differentiate cooperative from non‑cooperative neighbours, promoting network reciprocity and clustering of strategies. Empirical results show superior cooperation and payoffs compared with evolutionary game theory baselines, with cooperation remaining robust up to a dilemma strength of and improving with longer memory. The work provides a scalable, explicit‑network framework for studying the coevolution of cooperation and interaction, with implications for designing cooperative artificial and human‑AI systems.

Abstract

The significance of network structures in promoting group cooperation within social dilemmas has been widely recognized. Prior studies attribute this facilitation to the assortment of strategies driven by spatial interactions. Although reinforcement learning has been employed to investigate the impact of dynamic interaction on the evolution of cooperation, there remains a lack of understanding about how agents develop neighbour selection behaviours and the formation of strategic assortment within an explicit interaction structure. To address this, our study introduces a computational framework based on multi-agent reinforcement learning in the spatial Prisoner's Dilemma game. This framework allows agents to select dilemma strategies and interacting neighbours based on their long-term experiences, differing from existing research that relies on preset social norms or external incentives. By modelling each agent using two distinct Q-networks, we disentangle the coevolutionary dynamics between cooperation and interaction. The results indicate that long-term experience enables agents to develop the ability to identify non-cooperative neighbours and exhibit a preference for interaction with cooperative ones. This emergent self-organizing behaviour leads to the clustering of agents with similar strategies, thereby increasing network reciprocity and enhancing group cooperation.
Paper Structure (23 sections, 9 equations, 6 figures)

This paper contains 23 sections, 9 equations, 6 figures.

Figures (6)

  • Figure 1: Training framework for developing dilemma and interaction strategies. Each iteration involves agent $i$ choosing dilemma strategy and selecting neighbouring agents for PDG engagement. Each agent uses two Q-networks: the dilemma policy network, which processes long-term actions in dilemmas by the agent and its neighbours, and the interaction selection network, which assesses neighbours' dilemma actions alongside the agent's previous interactions. The agent calculates the utility of its actions based on the rewards accumulated from past encounters.
  • Figure 2: RL-based training approach (ours) promotes cooperation more effectively than the EGT (baseline) method. The EGT (orange) represents agents solely calculating cumulative payoffs and adjusting dilemma actions through social learning. In contrast, the implementation of effective dilemma and selection policies, guided by RL (blue), has significantly enhanced the level of cooperation within the population. Our RL-based method maintains full cooperation in the population until dilemma strength exceeds $1.2$.
  • Figure 3: The evolution of cooperation and associated payoffs across varying dilemma strengths. In all scenarios, the fraction of cooperators first decreases and then increases over time, coinciding with reduced average individual payoffs and increased inequality as dilemma strength intensifies. The evaluation encompasses evolutionary trajectories of (a) cooperation level and (b) the Gini Coefficient, alongside metrics including (c) average group payoffs and (d)-(e) payoffs for trained cooperators and defectors, with dilemma strength varying from $b=1.20$ to $1.26$.
  • Figure 4: Temporal evolution of strategy connectivity and actual link interactions. RL agents demonstrate enhanced interaction capabilities, increasing connections with cooperative neighbours. (a) The average connectivity ratio for cooperators and defectors in the first half of the total timestep. (b) The frequency of actual link connections between dilemma strategies during the first ten episodes. The dilemma strength is set to $b=1.20$.
  • Figure 5: Snapshots of the spatial evolution of strategies and their connections. Cooperative individuals resist defector incursions by forming and expanding clusters. Panels (a)-(d) depict strategy distributions; (e)-(h) illustrate corresponding strategy connections at identical timesteps. Pixels represent agents as cooperators (blue) and defectors (red), with strategy connectivity ratio varying from 0 (shallow) to 1 (deep). The results are obtained for $b=1.20$.
  • ...and 1 more figures