Table of Contents
Fetching ...

Environment Complexity and Nash Equilibria in a Sequential Social Dilemma

Mustafa Yasir, Andrew Howes, Vasilios Mavroudis, Chris Hicks

TL;DR

This paper investigates how environment complexity in higher-dimensional sequential social dilemmas affects cooperative outcomes in multi-agent reinforcement learning. By adapting the gridworld Stag Hunt to eight complexity variants and analyzing independent PPO learners, the authors show that greater complexity biases convergence toward suboptimal, risk-dominant Nash equilibria, even when higher-reward strategies exist. Through curriculum experiments and empirical game-theoretic analysis, they demonstrate that some Group B environments can map to MGSD/SSD structures and that suboptimal equilibria are robust to learning dynamics, though more cooperative strategies can be learned under guided training. The work highlights the critical role of environment dynamics in shaping general-sum MARL outcomes and provides a framework for linking complex RL environments to classic game-theoretic analyses.

Abstract

Multi-agent reinforcement learning (MARL) methods, while effective in zero-sum or positive-sum games, often yield suboptimal outcomes in general-sum games where cooperation is essential for achieving globally optimal outcomes. Matrix game social dilemmas, which abstract key aspects of general-sum interactions, such as cooperation, risk, and trust, fail to model the temporal and spatial dynamics characteristic of real-world scenarios. In response, our study extends matrix game social dilemmas into more complex, higher-dimensional MARL environments. We adapt a gridworld implementation of the Stag Hunt dilemma to more closely match the decision-space of a one-shot matrix game while also introducing variable environment complexity. Our findings indicate that as complexity increases, MARL agents trained in these environments converge to suboptimal strategies, consistent with the risk-dominant Nash equilibria strategies found in matrix games. Our work highlights the impact of environment complexity on achieving optimal outcomes in higher-dimensional game-theoretic MARL environments.

Environment Complexity and Nash Equilibria in a Sequential Social Dilemma

TL;DR

This paper investigates how environment complexity in higher-dimensional sequential social dilemmas affects cooperative outcomes in multi-agent reinforcement learning. By adapting the gridworld Stag Hunt to eight complexity variants and analyzing independent PPO learners, the authors show that greater complexity biases convergence toward suboptimal, risk-dominant Nash equilibria, even when higher-reward strategies exist. Through curriculum experiments and empirical game-theoretic analysis, they demonstrate that some Group B environments can map to MGSD/SSD structures and that suboptimal equilibria are robust to learning dynamics, though more cooperative strategies can be learned under guided training. The work highlights the critical role of environment dynamics in shaping general-sum MARL outcomes and provides a framework for linking complex RL environments to classic game-theoretic analyses.

Abstract

Multi-agent reinforcement learning (MARL) methods, while effective in zero-sum or positive-sum games, often yield suboptimal outcomes in general-sum games where cooperation is essential for achieving globally optimal outcomes. Matrix game social dilemmas, which abstract key aspects of general-sum interactions, such as cooperation, risk, and trust, fail to model the temporal and spatial dynamics characteristic of real-world scenarios. In response, our study extends matrix game social dilemmas into more complex, higher-dimensional MARL environments. We adapt a gridworld implementation of the Stag Hunt dilemma to more closely match the decision-space of a one-shot matrix game while also introducing variable environment complexity. Our findings indicate that as complexity increases, MARL agents trained in these environments converge to suboptimal strategies, consistent with the risk-dominant Nash equilibria strategies found in matrix games. Our work highlights the impact of environment complexity on achieving optimal outcomes in higher-dimensional game-theoretic MARL environments.
Paper Structure (28 sections, 3 equations, 5 figures, 6 tables)

This paper contains 28 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Training performance of IPPO in eight environments with varying degrees of complexity, detailed in Table \ref{['tab:fullenvlist']}. Environments are grouped by reward patterns: Group A (left) includes FFF, RFF, FRF, RRF, FRR, and Group B (right) includes FFR, RFR, RRR. Bold lines represent the average performance over five trials, and the shading represents ±1 standard deviation from each point.
  • Figure 2: Training performance of IPPO in a 2-stage curriculum in Group B environments from Section \ref{['sec:exp1']}. Agents are initially trained in a cooperation-inducing environment (cXXX), before being trained in their target environments.
  • Figure 3: Rendering of the gridworld Stag Hunt environment used in this study.
  • Figure 4: Training performance of PPO across eight environments of varying complexity, with labels detailed Table \ref{['tab:fullenvlist']}. Environments are grouped by reward patterns: Group A (left) includes FFF, RFF, FRF, RRF, FRR, and Group B (right) includes FFR, RFR, RRR. Bold lines represent the average performance across five trials and the shading represents ±1 standard deviation from each point.
  • Figure 5: Training performance of IPPO in Group B environments from Section \ref{['sec:exp1']}, using the best hyperparameters found from five trials consisting of 1,000 iterations each.