Table of Contents
Fetching ...

Evolutionary Multi-agent Reinforcement Learning in Group Social Dilemmas

Brian Mintz, Feng Fu

TL;DR

This work considers the fundamental framework of Q-learning in public goods games, where RL individuals must work together to achieve a common goal and finds selection for higher and lower levels of exploration, as well as attracting values, and a condition that separates these in a restricted class of games.

Abstract

Reinforcement learning (RL) is a powerful machine learning technique that has been successfully applied to a wide variety of problems. However, it can be unpredictable and produce suboptimal results in complicated learning environments. This is especially true when multiple agents learn simultaneously, which creates a complex system that is often analytically intractable. Our work considers the fundamental framework of Q-learning in Public Goods Games, where RL individuals must work together to achieve a common goal. This setting allows us to study the tragedy of the commons and free rider effects in AI cooperation, an emerging field with potential to resolve challenging obstacles to the wider application of artificial intelligence. While this social dilemma has been mainly investigated through traditional and evolutionary game theory, our approach bridges the gap between these two by studying agents with an intermediate level of intelligence. Specifically, we consider the influence of learning parameters on cooperation levels in simulations and a limiting system of differential equations, as well as the effect of evolutionary pressures on exploration rate in both of these models. We find selection for higher and lower levels of exploration, as well as attracting values, and a condition that separates these in a restricted class of games. Our work enhances the theoretical understanding of evolutionary Q-learning, and extends our knowledge of the evolution of machine behavior in social dilemmas.

Evolutionary Multi-agent Reinforcement Learning in Group Social Dilemmas

TL;DR

This work considers the fundamental framework of Q-learning in public goods games, where RL individuals must work together to achieve a common goal and finds selection for higher and lower levels of exploration, as well as attracting values, and a condition that separates these in a restricted class of games.

Abstract

Reinforcement learning (RL) is a powerful machine learning technique that has been successfully applied to a wide variety of problems. However, it can be unpredictable and produce suboptimal results in complicated learning environments. This is especially true when multiple agents learn simultaneously, which creates a complex system that is often analytically intractable. Our work considers the fundamental framework of Q-learning in Public Goods Games, where RL individuals must work together to achieve a common goal. This setting allows us to study the tragedy of the commons and free rider effects in AI cooperation, an emerging field with potential to resolve challenging obstacles to the wider application of artificial intelligence. While this social dilemma has been mainly investigated through traditional and evolutionary game theory, our approach bridges the gap between these two by studying agents with an intermediate level of intelligence. Specifically, we consider the influence of learning parameters on cooperation levels in simulations and a limiting system of differential equations, as well as the effect of evolutionary pressures on exploration rate in both of these models. We find selection for higher and lower levels of exploration, as well as attracting values, and a condition that separates these in a restricted class of games. Our work enhances the theoretical understanding of evolutionary Q-learning, and extends our knowledge of the evolution of machine behavior in social dilemmas.

Paper Structure

This paper contains 5 sections, 7 equations, 5 figures.

Figures (5)

  • Figure 1: Stochastic learning dynamics with symmetric temperature. These plots show the trajectories of strategies in the agent-based simulation over time as dotted lines, with the average strategy in bold, where the rewards are [0, 0, 0, 2, 4, 6], $N=5$, $\gamma = 0$, $\alpha = 0.1$, $r = 0$, and $T=0.5$ in panel (a) and $T=1$ in panel (b). By varying the temperature, a range of behaviors are possible. For low temperatures, relative to learning rate and rewards, agents enter a self-reinforcing cycle where they choose the most beneficial action repeatedly. For large temperatures, the strategies fail to converge. We see good alignment with the predictions of the ODE model, that strategies cluster together when the temperatures are the same.
  • Figure 2: The optimal learning parameters vary with the reward function. This plot shows the average, over 100 runs, strategy in the group after 500 iterations where the horizontal axis is the learning rate and vertical axis is the discount factor, both between zero and one. Further, $r=0$ so there is no replacement, the temperature is $T=0.5$, the population consists of five agents, and the reward function is linear $f(x) = kx$ with $k = 0.9$ on the left, and $1.1$ on the right. In these cases the jumps are a constant of $k$, so in the first it is always slightly better to defect, despite this agents contribute 85% of the time with a higher learning rate and low discount rate. Similarly, on the right it is always slightly better to contribute, and a wider range of learning rates achieve a similar probability of contributing. Because of the nonzero temperature, it is impossible to achieve perfect cooperation, a strategy of one, and the achieved values are approximately the largest possible given this temperature and the possible rewards for each action.
  • Figure 3: Learning and evolution can have varying effects on contribution levels. By varying the learning rate $T$ and replacement probability $r$, one can tune the relative strength of learning and evolution. This plots the average, plus or minus one standard error, over 20 runs of group level of contribution after 1000 iterations. These simulation are initialized with a single temperature, with no mutation in temperature, so selection is only acting on the strategies.
  • Figure 4: The reward function can lead to positive or negative selection. This plot represents the most likely outcome of the evolutionary dynamics in the temperature parameter, starting from $T=0.05$ and up to $T=1$, over the space of possible reward functions for the three player game, found through the adaptive dynamics approach described in the appendix. Letting $m$ be the maximum reward for when all individuals contribute, we can specify the function as $[0, j_0, j_0+j_1,m]$ where $j_0$ and $j_1$ are the jumps in reward when an additional individual contributes, if zero or one other had already contributed. Assuming the reward function is increasing, we have $0\le j_0\le m$ and $0\le j_1 \le m-j_0$, so only the values in the lower triangle are considered. We see there is a clear transition to larger final temperatures when $j_0+j_1$ exceeds a threshold depending on $m$, in this case $m=10$.
  • Figure 5: Null-manifold of symmetric learning dynamics over the space of three player Public Goods Games. This plots the equilibria of the learning dynamics where $j_0=x$, $j_1=y$, and $z$ is the strategy, assuming $j_2 = m-j_0-j_1$. Specifically, this plot uses $m=3$ and $T=0.1$. The red plane delineates the rejoin $j_0+j_1 \le m$, and the green region is the subset where the initial rate of change of the strategy is positive. In this case, the maximum equilibrium contribution level, that is reached from an initial strategy of 0.5, occurs when $y$ is on this boundary, and $x$ is small. Note a large range of $x$, from zero to 0.1, have approximately the same level of contribution. Additionally, these values are close to having a negative initial change in the strategy, likely making them unstable for the actual dynamics, and thus possibly resulting in less frequent contribution.