Global Behavior of Learning Dynamics in Zero-Sum Games with Memory Asymmetry

Yuma Fujimoto; Kaito Ariu; Kenshi Abe

Global Behavior of Learning Dynamics in Zero-Sum Games with Memory Asymmetry

Yuma Fujimoto, Kaito Ariu, Kenshi Abe

TL;DR

This work studies global learning dynamics in two-player zero-sum games with asymmetric memory, where X remembers Y's previous action and Y does not. It introduces two indicators—the extended conditional KL divergence $D(X,y)$ and a family of Lyapunov functions $H(X;\delta)$—to capture global behavior beyond local stability. The authors prove that when X exploits Y, $D(X,y)$ decreases (indicating convergence to the Nash equilibrium), and $H$ monotonically increases (quantifying X's increasing exploitability of Y), and they validate global convergence in the Matching Pennies game and related coupled variants. These results provide a novel, quantitative framework for understanding and predicting global learning outcomes in with-memory games, with implications for designing and analyzing learning dynamics in multi-agent systems.

Abstract

This study examines the global behavior of dynamics in learning in games between two players, X and Y. We consider the simplest situation for memory asymmetry between two players: X memorizes the other Y's previous action and uses reactive strategies, while Y has no memory. Although this memory complicates their learning dynamics, we characterize the global behavior of such complex dynamics by discovering and analyzing two novel quantities. One is an extended Kullback-Leibler divergence from the Nash equilibrium, a well-known conserved quantity from previous studies. The other is a family of Lyapunov functions of X's reactive strategy. One of the global behaviors we capture is that if X exploits Y, then their strategies converge to the Nash equilibrium. Another is that if Y's strategy is out of equilibrium, then X becomes more exploitative with time. Consequently, we suggest global convergence to the Nash equilibrium from both aspects of theory and experiment. This study provides a novel characterization of the global behavior in learning in games through a couple of indicators.

Global Behavior of Learning Dynamics in Zero-Sum Games with Memory Asymmetry

TL;DR

and a family of Lyapunov functions

—to capture global behavior beyond local stability. The authors prove that when X exploits Y,

decreases (indicating convergence to the Nash equilibrium), and

monotonically increases (quantifying X's increasing exploitability of Y), and they validate global convergence in the Matching Pennies game and related coupled variants. These results provide a novel, quantitative framework for understanding and predicting global learning outcomes in with-memory games, with implications for designing and analyzing learning dynamics in multi-agent systems.

Abstract

Paper Structure (29 sections, 5 theorems, 16 equations, 4 figures)

This paper contains 29 sections, 5 theorems, 16 equations, 4 figures.

Introduction
Preliminary
Settings
Stationary State and Expected Payoff
Nash Equilibrium
Learning Algorithm: Replicator Dynamics
Theory on Learning Dynamics
Polynomial Expressions of Learning
Positive Definiteness for Zero-Sum Vectors
Extended Kullback-Leibler Divergence
Family of Lyapunov Functions
Global Behavior by Two Quantities
$D$ explains increasing/decreasing of distance:
$H$ explains the monotonic increase of exploitability:
Remark:
...and 14 more sections

Key Result

Theorem 1

If $\boldsymbol{X}^{\rm T}\boldsymbol{U}$ is positive definite for zero-sum vector, $D^{\dagger}(\boldsymbol{X};{\rm d}\boldsymbol{y}):=\dot{D}(\boldsymbol{X},\boldsymbol{y})<0$ for all ${\rm d}\boldsymbol{y}:=\boldsymbol{y}-\boldsymbol{y}^{*}\neq \boldsymbol{0}$.

Figures (4)

Figure 1: (A) Illustration of the global behavior of the conditional divergence, $D(\boldsymbol{X},\boldsymbol{y})$. Three trajectories (red, black, and blue) are plotted with the Nash equilibrium (the black star marker). The horizontal and vertical axes show X's strategy ($x_{1}^{{\rm st}}$) and Y's strategy ($y_{1}$) in the matching pennies game (formulated in Fig. \ref{['F02']}). This divergence decreases (red: $\dot{D}<0$), cycles (black: $\dot{D}=0$), or increases (blue: $\dot{D}>0$) with time. These three lines are plotted for the different initial strategies, i.e., $\boldsymbol{X}$ and $\boldsymbol{y}$. (B) Illustration of the global behavior of the family of Lyapunov functions, $H(\boldsymbol{X};\boldsymbol{\delta})$. The colored line shows a trajectory (from purple to red) of Lyapunov functions $H_1$, $H_2$, and $H_3$, each of which is $H(\boldsymbol{X};\boldsymbol{\delta})$ for some specific $\boldsymbol{\delta}$. The gray broken lines are the projections of the black solid line to $H_1$-$H_2$, $H_2$-$H_3$, and $H_3$-$H_1$ planes. All of $H_1$, $H_2$, and $H_3$ monotonically increase with time.
Figure 2: Illustration of games between reactive and zero-memory strategies. The area surrounded by the magenta dotted line shows the normal-form game. In each round, X chooses action $i=1$ or $2$ in the row, following its strategy, i.e., the probability distribution of $\boldsymbol{x}=(x_1,x_2)$. On the other hand, Y chooses action $j=1$ or $2$ in the column, following its strategy, i.e., the probability distribution of $\boldsymbol{y}=(y_1,y_2)$. Depending on their choices $i$ and $j$, X receives a payoff $u_{ij}$, given by a matrix form of $\boldsymbol{U}=(u_{ij})_{i,j}=((u_{11},u_{12}), (u_{21},u_{22}))$. Furthermore, in zero-sum games, Y receives $-u_{ij}$. Especially in the matching pennies game, their actions of $1$ ($2$) correspond to the choice of "head" ("tail") of a coin. When their choices match $i=j$, X wins, i.e., $u_{11}=u_{22}=1$ (the orange blocks). Else when their choices mismatch $i\neq j$, Y wins, i.e., $u_{12}=u_{21}=-1$ (the blue blocks). The area outside of the magenta dotted line shows the difference due to an effect of memory. The gray box shows that X memorizes Y's previous action, represented as $j=1$ or $2$. Thus, X uses a reactive strategy and can choose its action with the conditional probability of $x_{1|j}$ and $x_{2|j}$ for Y's previous action.
Figure 3: (A) Trajectories of $q_1$ and $q_2$. The rainbow contour plot indicates the value of $q_1-q_2$. All the trajectories monotonically increase $q_1-q_2$ with time and converge in the area of $q_1>q_2$ in their final states. (B) Trajectories of the learning dynamics. The black broken line corresponds to the region of Nash equilibria, $\boldsymbol{x}^{{\rm st}}=\boldsymbol{y}=(1/2,1/2)$. Each colored line shows a trajectory of the learning dynamics. First, the circle markers show the initial states. Following the blue lines, the trajectories diverge from the Nash equilibria ($D(\boldsymbol{X},\boldsymbol{y})$ increases with time). However, the trajectories stop to diverge and switch to converge to the Nash equilibria ($D(\boldsymbol{X},\boldsymbol{y})$ decreases), following the red lines. The star markers are the final states and correspond to one of the Nash equilibria.
Figure 4: Global convergence in the coupled matching pennies games, where the second, third, fourth, and first actions win the other's first, second, third, and fourth actions, respectively. The winner receives the payoff of $2$ (the orange blocks in the matrices for the winning of X), while the loser sends the payoff of $2$ (the blue blocks). We now introduce three variants for the other blocks in the payoff matrix. (A) The case of interior equilibrium. We set each of the other blocks by random numbers in $[-1,1]$ (the gray blocks). Then, Y's strategy converges to the unique Nash equilibrium (the red star marker) independent of its initial state (the blue circle markers). (B) The case of continuous equilibrium. We set each of the other blocks by $0$, where the payoff matrix degenerates. Y's strategy converges to one of the Nash equilibria (the line consisting of the red star markers) depending on its initial state. (C) The case of boundary equilibrium. Only the block for the interaction between an action is set to $-1$, and the others are $0$. If so, X's strategy converges to the unique Nash equilibrium (the orange star markers) independent of its initial state (the green circle markers). Instead, Y's strategies do not converge.

Theorems & Definitions (7)

Definition 1: Positive definiteness for zero-sum vectors
Theorem 1: Monotonic decrease of $D$ for positive definite $\boldsymbol{X}^{\rm T}\boldsymbol{U}$
Theorem 2: Monotonic increase of $H$
Corollary 1: Global convergence in matching pennies
Definition 2: Positive definiteness
Theorem 3: Equivalence to positive definiteness
Theorem 4: Connection with the classical total divergence

Global Behavior of Learning Dynamics in Zero-Sum Games with Memory Asymmetry

TL;DR

Abstract

Global Behavior of Learning Dynamics in Zero-Sum Games with Memory Asymmetry

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (7)