LMFPPO-UBP: Local Mean Field Proximal Policy Optimization with Unbalanced Punishment for Spatial Public Goods Games

Jinshuo Yang; Zhaoqilin Yang; Wenjie Zhou; Xin Wang; Youliang Tian

LMFPPO-UBP: Local Mean Field Proximal Policy Optimization with Unbalanced Punishment for Spatial Public Goods Games

Jinshuo Yang, Zhaoqilin Yang, Wenjie Zhou, Xin Wang, Youliang Tian

TL;DR

This paper tackles the challenge of fostering cooperation in spatial public goods games by introducing Local Mean-Field Proximal Policy Optimization with Unbalanced Punishment (LMFPPO-UBP). It integrates a localized mean-field perceptual signal into a PPO-based multi-agent RL framework and couples it with an Unbalanced Punishment mechanism that penalizes defectors in proportion to local cooperative density, without harming cooperators. Empirical results show that LMFPPO-UBP lowers the cooperation threshold, enabling rapid, stable global cooperation across various initial conditions and initialization schemes, outperforming LMFPPO, Fermi updates, and Q-learning. The work merges local social dynamics with policy-gradient learning to design scalable, socially aware decentralized coordination strategies applicable to traffic, energy grids, sensor networks, and robotic swarms.

Abstract

Spatial public goods games are characterized by high-dimensional state spaces and localized externalities, which pose significant challenges for achieving stable and widespread cooperation. Traditional approaches often struggle to effectively capture neighborhood-level strategic interactions and dynamically align individual incentives with collective welfare. To resolve this issue, this paper introduces a novel intelligent decision-making framework called Local Mean-Field Proximal Policy Optimization with Unbalanced Punishment (LMFPPO-UBP). The conventional mean field concept is reformulated as a socio-statistical sensor embedded directly into the policy gradient space of deep reinforcement learning, allowing agents to adapt their strategies based on mesoscale neighborhood dynamics. Additionally, an unbalanced punishment mechanism is integrated to penalize defectors proportionally to the local density of cooperators, thereby reshaping the payoff structures without imposing direct costs on cooperative agents. Experimental results demonstrate that the LMFPPO-UBP promotes rapid and stable global cooperation even under low enhancement factors, consistently outperforming baseline methods such as Q-learning and Fermi update rules. Statistical analyses further validate the framework's effectiveness in lowering the cooperation threshold and achieving better coordinated outcomes.

LMFPPO-UBP: Local Mean Field Proximal Policy Optimization with Unbalanced Punishment for Spatial Public Goods Games

TL;DR

Abstract

Paper Structure (18 sections, 15 equations, 10 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 15 equations, 10 figures, 1 table, 1 algorithm.

Introduction
Related Works
Model
SPGG
LMFPPO
LMFPPO-UBP
Actor–Critic Network
Experimental Results
Experimental Setup
LMFPPO Hyperparameter Sensitivity Analysis
LMFPPO-UBP Hyperparameter Sensitivity Analysis
Statistical analysis of LMFPPO-UBP
Comparative Analysis of Algorithms
Algorithm performance evaluation under varying enhancement factors r
LMFPPO-UBP with half-and-half initialization
...and 3 more sections

Figures (10)

Figure 1: The architecture of actor-critic network.
Figure 2: The entropy regularization coefficient $\rho$ significantly affects LMFPPO performance. Empirically, $\rho$=0.01 is optimal. Higher $\rho$ destabilizes gradient trajectories and value function estimation, impairing convergence.
Figure 3: Impact of punishment strength $p$ on LMFPPO-UBP.
Figure 4: Violin plots of final cooperation fractions from 50 trials for three algorithms. LMFPPO-UBP shows a sharp, left-shifted transition, while LMFPPO and PPO exhibit transitions only near $r=5.0$.
Figure 5: Comparison of cooperation fractions from 50 trials for (a) LMFPPO-UBP, (b) LMFPPO, and (c) PPO. Error bars indicate means and standard deviations. LMFPPO-UBP shows a sharp transition at $r\in[4.0,4.3]$, whereas LMFPPO and PPO exhibit delayed, variable transitions only near $r=5.0$.
...and 5 more figures

LMFPPO-UBP: Local Mean Field Proximal Policy Optimization with Unbalanced Punishment for Spatial Public Goods Games

TL;DR

Abstract

LMFPPO-UBP: Local Mean Field Proximal Policy Optimization with Unbalanced Punishment for Spatial Public Goods Games

Authors

TL;DR

Abstract

Table of Contents

Figures (10)