Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games

Yuma Fujimoto; Kaito Ariu; Kenshi Abe

Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games

Yuma Fujimoto, Kaito Ariu, Kenshi Abe

TL;DR

The paper investigates learning in two-player zero-sum games under memory asymmetry and introduces a discretized multi-memory gradient ascent (MMGA) together with a Markov-transition formulation to analyze stationary payoffs $u^{st}$ and $v^{st}$. It shows that the original Nash equilibrium from memoryless games splits into unstable and stable fixed points, with heteroclinic dynamics connecting them; the longer-memory player exploits the opponent to induce a strictly concave payoff for the other, driving convergence to the stable fixed points, i.e. the with-memory NE. The authors prove local convergence under a key concavity condition and validate the theory with simulations across various memory lengths and action counts, observing robust last-iterate convergence to the original NE. These results reveal a novel convergence mechanism powered by memory asymmetry, with potential implications for computing equilibria in learning dynamics and broader strategic settings.

Abstract

Learning in games considers how multiple agents maximize their own rewards through repeated games. Memory, an ability that an agent changes his/her action depending on the history of actions in previous games, is often introduced into learning to explore more clever strategies and discuss the decision-making of real agents like humans. However, such games with memory are hard to analyze because they exhibit complex phenomena like chaotic dynamics or divergence from Nash equilibrium. In particular, how asymmetry in memory capacities between agents affects learning in games is still unclear. In response, this study formulates a gradient ascent algorithm in games with asymmetry memory capacities. To obtain theoretical insights into learning dynamics, we first consider a simple case of zero-sum games. We observe complex behavior, where learning dynamics draw a heteroclinic connection from unstable fixed points to stable ones. Despite this complexity, we analyze learning dynamics and prove local convergence to these stable fixed points, i.e., the Nash equilibria. We identify the mechanism driving this convergence: an agent with a longer memory learns to exploit the other, which in turn endows the other's utility function with strict concavity. We further numerically observe such convergence in various initial strategies, action numbers, and memory lengths. This study reveals a novel phenomenon due to memory asymmetry, providing fundamental strides in learning in games and new insights into computing equilibria.

Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games

TL;DR

and

. It shows that the original Nash equilibrium from memoryless games splits into unstable and stable fixed points, with heteroclinic dynamics connecting them; the longer-memory player exploits the opponent to induce a strictly concave payoff for the other, driving convergence to the stable fixed points, i.e. the with-memory NE. The authors prove local convergence under a key concavity condition and validate the theory with simulations across various memory lengths and action counts, observing robust last-iterate convergence to the original NE. These results reveal a novel convergence mechanism powered by memory asymmetry, with potential implications for computing equilibria in learning dynamics and broader strategic settings.

Abstract

Paper Structure (21 sections, 5 theorems, 21 equations, 7 figures, 1 algorithm)

This paper contains 21 sections, 5 theorems, 21 equations, 7 figures, 1 algorithm.

Introduction
Preliminary
Two-Player Normal-Form Games
Games with Memory Asymmetry
Formulation as Markov Transition Processes
Algorithm
Theoretical Results
Assumptions
Analysis of Nash Equilibrium
Analysis of Learning Dynamics
Visualization of Heteroclinic Dynamics
Experimental Results
Dynamics for Various Memory Lengths.
Dynamics for Various Action Numbers.
Experimental Results with Many Samples.
...and 6 more sections

Key Result

Theorem 1

Under Def. def_One-memory, the stationary state can be described as $\boldsymbol{p}^{{\rm st}}(\boldsymbol{x},y)=(x^{{\rm st}},\tilde{x}^{{\rm st}})\otimes(y,\tilde{y}):=(x^{{\rm st}}y,x^{{\rm st}}\tilde{y},\tilde{x}^{{\rm st}}y,\tilde{x}^{{\rm st}}\tilde{y})$. Here, $x^{{\rm st}}$ is called X's "ma

Figures (7)

Figure 1: Learning dynamics illustrated for three memory-configuration scenarios involving two agents. Learning dynamics show a cycling behavior around the Nash equilibrium when the agents have no memory (left panel). Learning dynamics diverge from the Nash equilibrium when the agents have the same memory capacity (center). Learning dynamics draw heteroclinic orbits and eventually converge to the Nash equilibrium when the agents have different memory lengths (right). In all the panels, the horizontal and vertical axes indicate the probabilities of the agents choosing "head" in matching-pennies games (see Fig. \ref{['F02']}). In the center and right panels, the color gradient indicates the passage of time (blue represents older data, and red represents newer data).
Figure 1: Experimental results with many samples. In each panel, the solid line shows the mean value of KL divergence for $50$ samples. The lightly colored area shows the standard deviation estimated from the $50$ samples. When $m=2$, we consider a matching-pennies game, where $x_a^{\rm o}=y_b^{\rm o}=1/2$ for all $a$ and $b$. When $m=3$, we consider a rock-paper-scissors game, where $x_a^{\rm o}=y_b^{\rm o}=1/3$ for all $a$ and $b$. When $m=4$, we consider an extended-rock-paper-scissors game, where $x_a^{\rm o}=y_b^{\rm o}=1/4$ for all $a$ and $b$.
Figure 2: A: Schematics of with-memory games. The area surrounded by the magenta dotted line shows a classic normal-form game, where player X (green) chooses action either $a_1$ or $a_2$ with the probability of $x_{a_1}$ and $x_{a_2}$ in the row of the matrix, while player Y (orange) chooses action either $b_1$ or $b_2$ with the probability of $y_{b_1}$ and $y_{b_2}$ in the column. Especially in the matching-pennies game, $a_1=b_1=$ "head" and $a_2=b_2=$ "tail". The matching of X's and Y's actions leads to X's win (green panel), while the mismatching leads to Y's win (orange panel). In with-memory games, $x_{a_i}$ and $y_{b_i}$ is given by $x_{a_i|s}$ and $y_{b_i|s_{n_{\rm Y}}}$. Here, $s$ is the string of their actions played in the previous $n_{\rm X}$ rounds. In addition, because Y has a shorter memory than X, $s_{n_{\rm Y}}$ is defined as the substring of $s$. B: Schematics of Markov transition in with-memory games. In the transition from $s$ to $s'$, $s_{n_{\rm X}-1}$ (blue), i.e., the last $2(n_{\rm X}-1)$ substring of $s$ continues to exist in $s'$. X and Y choose actions $a_1$ (green) and $b_2$ (orange) are appended to this substring $s_{n_{\rm X}-1}$. These choices occur with the probability of $M_{s's}$ and give X and Y the payoffs of $u_{a_1b_2}$ and $v_{a_1b_2}$, respectively.
Figure 2: Under X's strategy fixed after it learns, Y's utility function is plotted depending on Y's strategy, i.e., ${\bf y}=\{y_1,y_2,y_3\}$. In all the panels, the red (blue) color indicates that Y's utility is large (small). The lower-left point of the simplex indicates that Y purely takes $b_1=$Rock action (i.e., $y_1=1$). The lower-right point indicates $b_2=$Paper action (i.e., $y_2=1$). The upper-middle point indicates $b_3=$Scissors action (i.e., $y_3=1$). In the ordinary (left) and weighted (center) rock-paper-scissors games, Y's utility function is strictly concave and takes its maximum value in the Nash equilibrium. In the right panel, X uses a $0$-memory strategy, and Y's utility function is linear.
Figure 3: An example of heteroclinic orbit. A: Time series of $\boldsymbol{x}=(x_1,x_2,x_3,x_4)$, $x^{{\rm st}}(\boldsymbol{x},y)$, and $y$. B: Illustration of trajectories of learning dynamics. The trajectory is plotted in a three-dimensional space consisting of $x^{{\rm st}}(\boldsymbol{x},y)$, $y$, and $-\tilde{x}_1x_4+\tilde{x}_2x_3$. Only the solid line and the cross mark (i.e., $-\tilde{x}_1x_4+\tilde{x}_2x_3\ge 0$) is the Nash equilibrium, but the solid and dashed gray lines indicate the states that correspond to the original Nash equilibrium, i.e., $(x^{{\rm st}}(\boldsymbol{x},y),y)=(x^{\rm o},y^{\rm o})$. Note that this line should be a three-dimensional manifold in the five-dimensional space of $\boldsymbol{x}$ and $y$ in practice. The solid (resp. dashed) line satisfies the stable condition $-\tilde{x}_1x_4+\tilde{x}_2x_3> 0$ (resp. $<0$). The trajectory is plotted from blue (time is $0$) to red (time is $180$) by the time series in panel A.
...and 2 more figures

Theorems & Definitions (7)

Definition 1: Original Nash equilibrium in two-action zero-sum normal-form game
Definition 2: One-memory and zero-memory strategies and vector notation
Theorem 1: Stationary state
Theorem 2: With-memory Nash equilibrium
Theorem 3: Fixed points of learning dynamics
Theorem 4: Local convergence to Nash equilibrium
Corollary 1: Divergence from fixed points

Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games

TL;DR

Abstract

Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (7)