Table of Contents
Fetching ...

Bottom-Up Reputation Promotes Cooperation with Multi-Agent Reinforcement Learning

Tianyu Ren, Xuan Yao, Yang Li, Xiao-Jun Zeng

TL;DR

The paper tackles cooperation in multi-agent reinforcement learning when reputations are privately formed. It introduces Learning with Reputation Reward (LR2), where each agent learns a dilemma policy $\pi^i$ for action selection and an evaluation policy $\eta^i$ to assign reputations, reshaping neighbor rewards via reputations with payoffs constrained to $R=1$, $P=0$, $0\le T\le 2$, $-1\le S\le 1$. Evaluations on spatial social dilemmas on a lattice show LR2 yields stronger cooperation and emergent strategy clustering, outperforming baselines and ablations. Key insights include LR2’s robustness to strong dilemmas, formation of cooperative clusters, and the benefit of balanced reputation-alignment rather than strict enforcement of consensus.

Abstract

Reputation serves as a powerful mechanism for promoting cooperation in multi-agent systems, as agents are more inclined to cooperate with those of good social standing. While existing multi-agent reinforcement learning methods typically rely on predefined social norms to assign reputations, the question of how a population reaches a consensus on judgement when agents hold private, independent views remains unresolved. In this paper, we propose a novel bottom-up reputation learning method, Learning with Reputation Reward (LR2), designed to promote cooperative behaviour through rewards shaping based on assigned reputation. Our agent architecture includes a dilemma policy that determines cooperation by considering the impact on neighbours, and an evaluation policy that assigns reputations to affect the actions of neighbours while optimizing self-objectives. It operates using local observations and interaction-based rewards, without relying on centralized modules or predefined norms. Our findings demonstrate the effectiveness and adaptability of LR2 across various spatial social dilemma scenarios. Interestingly, we find that LR2 stabilizes and enhances cooperation not only with reward reshaping from bottom-up reputation but also by fostering strategy clustering in structured populations, thereby creating environments conducive to sustained cooperation.

Bottom-Up Reputation Promotes Cooperation with Multi-Agent Reinforcement Learning

TL;DR

The paper tackles cooperation in multi-agent reinforcement learning when reputations are privately formed. It introduces Learning with Reputation Reward (LR2), where each agent learns a dilemma policy for action selection and an evaluation policy to assign reputations, reshaping neighbor rewards via reputations with payoffs constrained to , , , . Evaluations on spatial social dilemmas on a lattice show LR2 yields stronger cooperation and emergent strategy clustering, outperforming baselines and ablations. Key insights include LR2’s robustness to strong dilemmas, formation of cooperative clusters, and the benefit of balanced reputation-alignment rather than strict enforcement of consensus.

Abstract

Reputation serves as a powerful mechanism for promoting cooperation in multi-agent systems, as agents are more inclined to cooperate with those of good social standing. While existing multi-agent reinforcement learning methods typically rely on predefined social norms to assign reputations, the question of how a population reaches a consensus on judgement when agents hold private, independent views remains unresolved. In this paper, we propose a novel bottom-up reputation learning method, Learning with Reputation Reward (LR2), designed to promote cooperative behaviour through rewards shaping based on assigned reputation. Our agent architecture includes a dilemma policy that determines cooperation by considering the impact on neighbours, and an evaluation policy that assigns reputations to affect the actions of neighbours while optimizing self-objectives. It operates using local observations and interaction-based rewards, without relying on centralized modules or predefined norms. Our findings demonstrate the effectiveness and adaptability of LR2 across various spatial social dilemma scenarios. Interestingly, we find that LR2 stabilizes and enhances cooperation not only with reward reshaping from bottom-up reputation but also by fostering strategy clustering in structured populations, thereby creating environments conducive to sustained cooperation.

Paper Structure

This paper contains 20 sections, 13 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the social dilemma game with reputation and architecture of our LR2 agents. (a) Each agent is connected to four neighbours in a network. Each round consists of two phases: First, agents choose to cooperate or defect based on their reputations and those of their surrounding neighbours. In the second phase, agents receive reputation assignment reflecting how their behaviours are perceived within their neighbours' local group. (b) Agent $i$ updates its dilemma policy $\theta^i$ by considering both environmental rewards and assigned reputation. The evaluation policy $\eta^i$ is then updated based on the rewards accumulated by the updated dilemma policy $\hat{\eta}^i$.
  • Figure 2: Comparison of cooperation levels between LR2 and the D-D baseline across different $T$ and $S$ values. LR2 demonstrates a more effective promotion of cooperation in the PD game. (a) D-D baseline: agents optimize dilemma policies based solely on environmental rewards. (b) LR2 method (ours): agents utilize both dilemma and evaluation policies, with rewards reshaped by reputation. The colour gradient from red to blue represents cooperation levels ranging from $0$ to $1$.
  • Figure 3: The evolution of cooperation with associated rewards and reputations. LR2 agents learn to assess their neighbours' behaviours to reshape rewards, fostering cooperative evolution. The evaluation includes (a) the evolutionary trajectory of cooperation levels, (b) the average reputation of cooperative and defective agents at the end of training, and (c)-(d) the rewards of cooperator and defector over time. Results are presented with the parameter $T$ varying from $1.30$ to $1.37$, while $S$ is fixed at $-0.33$.
  • Figure 4: Representative snapshots showing the spatial distribution of learned dilemma actions and assigned reputations on a square lattice. The LR2 method fosters the formation of cooperator clusters through spatial effects, thereby enhancing the overall level of cooperation. Panels (a)-(e) display the evolutionary trajectories of two competing dilemma strategies at the timesteps $t= 1$k, $5$k, $10$k, $20$k, and $50$k. The corresponding panels (f)-(j) illustrate the average reputations assigned by neighbours at the same timesteps. Pixels represent agents as cooperators (blue) and defectors (red), with reputation levels ranging from $0$ to $1$. Results are obtained for $T=1.33$ and $S=-0.33$.
  • Figure 5: Ablations on LR2 architecture components. Considering others' reputation evaluations and the effects of assigned reputations most effectively promotes cooperation. (a) Reputation alignment varying importance, shown by colours from dark to light representing $\mu$ from $1$ to $0$. Parameter $S$ is fixed at $-0.33$. (b) Performance of the IPPO training method without reputation reward.