Learning with Delayed Payoffs in Population Games using Kullback-Leibler Divergence Regularization

Shinkyu Park; Naomi Ehrich Leonard

Learning with Delayed Payoffs in Population Games using Kullback-Leibler Divergence Regularization

Shinkyu Park, Naomi Ehrich Leonard

TL;DR

The paper tackles learning Nash equilibria in large population games where payoffs are subject to time delays. It introduces Kullback-Leibler Divergence Regularized Learning (KLD-RL), a regularized best-response rule that uses KL divergence to constrain strategy revisions, coupled with a distributed update scheme for the regularization parameter $\theta$. Using passivity-based analysis, the authors prove convergence of the social state to a perturbed Nash equilibrium $\text{PNE}_{\eta,\theta}(\mathcal{F})$ when the regularization weight $\eta$ exceeds a deficit bound, and extend results to two delay models: a time-dependent payoff delay and a smoothing PDM. Simulations on a two-population congestion game and a two-population zero-sum game demonstrate robust convergence to the Nash equilibrium despite delays, with insights on selecting $\eta$ and implementing distributed updates in finite populations. The work provides a principled, scalable approach for delay-robust learning in population games with practical relevance to traffic, smart grids, and security-related multi-agent systems.

Abstract

We study a multi-agent decision problem in large population games. Agents from multiple populations select strategies for repeated interactions with one another. At each stage of these interactions, agents use their decision-making model to revise their strategy selections based on payoffs determined by an underlying game. Their goal is to learn the strategies that correspond to the Nash equilibrium of the game. However, when games are subject to time delays, conventional decision-making models from the population game literature may result in oscillations in the strategy revision process or convergence to an equilibrium other than the Nash. To address this problem, we propose the Kullback-Leibler Divergence Regularized Learning (KLD-RL) model, along with an algorithm that iteratively updates the model's regularization parameter across a network of communicating agents. Using passivity-based convergence analysis techniques, we show that the KLD-RL model achieves convergence to the Nash equilibrium without oscillations, even for a class of population games that are subject to time delays. We demonstrate our main results numerically on a two-population congestion game and a two-population zero-sum game.

Learning with Delayed Payoffs in Population Games using Kullback-Leibler Divergence Regularization

TL;DR

. Using passivity-based analysis, the authors prove convergence of the social state to a perturbed Nash equilibrium

when the regularization weight

exceeds a deficit bound, and extend results to two delay models: a time-dependent payoff delay and a smoothing PDM. Simulations on a two-population congestion game and a two-population zero-sum game demonstrate robust convergence to the Nash equilibrium despite delays, with insights on selecting

and implementing distributed updates in finite populations. The work provides a principled, scalable approach for delay-robust learning in population games with practical relevance to traffic, smart grids, and security-related multi-agent systems.

Abstract

Paper Structure (30 sections, 69 equations, 9 figures, 1 table, 2 algorithms)

This paper contains 30 sections, 69 equations, 9 figures, 1 table, 2 algorithms.

Introduction
Problem Description
Population Games and Time Delays in Payoff Mechanisms
Population games
Time delays in payoff mechanisms
Payoff Function with a Time-Dependent Delay
Smoothing Payoff Dynamics Model
Strategy Revision and Evolutionary Dynamics Model
Literature Review
Learning Nash Equilibrium with Delayed Payoffs
Preliminary Convergence Analysis
Iterative KLD Regularization and Convergence Guarantee
Distributed Parameter Update
Simulations with Numerical Examples
Convergence and Performance Improvements
...and 15 more sections

Figures (9)

Figure 1: Two-Population Congestion Game. Agents in each population $k$ traverse from origin $O_k$ to destination $D_k$ using one of the following routes: $O_1 \to A \to D_1$ (Route 1), $O_1 \to A \to B \to D_1$ (Route 2), and $O_1 \to B \to D_1$ (Route 3) for population 1; $O_2 \to A \to D_2$ (Route 1), $O_2 \to B \to A \to D_2$ (Route 2), and $O_2 \to B \to D_2$ (Route 3) for population 2. We assume that when the same number of agents use the links, the diagonal links ($O_1 \to B, O_2 \to A, A \to D_1, B \to D_2$) are $50 \%$ more congested than the horizontal links, e.g., because the roads represented by the diagonal links are narrower; whereas the vertical link $A \leftrightarrow B$ is $50 \%$ less congested than the horizontal links, e.g., because the road associated with the vertical link is wider. The different weights on the links reflect this assumption.
Figure 2: Two-Population Zero-Sum Game. Agents in the defender population (population 1) select defending strategies $(DS_1, DS_2, DS_3)$ to play against those in the attacker population (population 2) who adopt attacking strategies $(AS_1, AS_2, AS_3)$. The positive (negative) weight on the blue (red dotted) arrow between $DS_i$ and $AS_j$ denotes the reward (loss) associated with population 1 when the defenders and attackers adopt $DS_i$ and $AS_j$, respectively. The payoff $\mathcal{F}^1_i(x^1, x^2)$ associated with $DS_i$ of population 1 is the sum of the rewards and losses when $x^2$ is the state of population 2; whereas the payoff $\mathcal{F}^2_j(x^1, x^2)$ associated with $AS_j$ of population 2 is the negative sum of the rewards and losses when $x^1$ is the state of population 1.
Figure 3: State trajectories of population $1$ under the logit protocol \ref{['eq:StandardLogitProtocol']} with $\eta=0.1, 4.5$ in the congestion game \ref{['eq:congestion_game']}. The payoff vector is determined by \ref{['eq:population_game_time_delay']} subject to a fixed unit time delay ($d(t)=1, ~ \forall t \geq 0$). The red circle in both (a) and (b) marks the Nash equilibrium and the red X mark in (b) denotes the unique limit point of all the trajectories.
Figure 4: State trajectories of population $1$ under the logit protocol \ref{['eq:StandardLogitProtocol']} with $\eta=0.1, 0.6$ in the zero-sum game \ref{['eq:rps_game']}. The payoff vector is determined by \ref{['eq:smoothing_pdm']} with $\lambda=1$. The red circle in both (a) and (b) marks the Nash equilibrium and the red X mark in (b) denotes the unique limit point of all the trajectories.
Figure 5: A feedback interconnection illustrating the payoff dynamics model and the KLD-RL model along with an algorithm for updating the regularization parameter $\theta$.
...and 4 more figures

Theorems & Definitions (11)

Definition 1: Nash Equilibrium
Example 1
Example 2
Definition 2: Contractive Population Game SANDHOLM2015703
Remark 1
Remark 2
Definition 3: Weak $\delta$-Antipassivity with Deficit $\bar{\nu}$ 9029756
Definition 4: $\delta$-Passivity with Surplus $\bar{\eta}$ 9029756
proof
Definition 5
...and 1 more

Learning with Delayed Payoffs in Population Games using Kullback-Leibler Divergence Regularization

TL;DR

Abstract

Learning with Delayed Payoffs in Population Games using Kullback-Leibler Divergence Regularization

Authors

TL;DR

Abstract

Table of Contents

Figures (9)

Theorems & Definitions (11)