Table of Contents
Fetching ...

Deterministic Model of Incremental Multi-Agent Boltzmann Q-Learning: Transient Cooperation, Metastability, and Oscillations

David Goll, Jobst Heitzig, Wolfram Barfuss

TL;DR

This work investigates the dynamics of independent Q-learning with Boltzmann exploration in a two-agent, single-state Prisoner's Dilemma. It shows that prior deterministic approximations (FAQL/BQL) fail to capture the true learning dynamics, which exhibit metastable cooperation and long-lived oscillations due to a moving-target environment. The authors develop a discrete-time, choice-probability-aware model that preserves per-action update frequencies in the full $4$-D Q-space and reveal a bifurcation-driven transition from a stable fixed point to oscillatory dynamics as the discount factor $\gamma$ increases, via a Neimark–Sacker bifurcation. The results highlight fundamental limitations of reduced policy-space analyses for MARL and emphasize the need to account for update frequencies and non-stationarity when interpreting learning dynamics, with implications for designing robust MARL algorithms and extending the analysis to more complex environments.

Abstract

Multi-Agent Reinforcement Learning involves agents that learn together in a shared environment, leading to emergent dynamics sensitive to initial conditions and parameter variations. A Dynamical Systems approach, which studies the evolution of multi-component systems over time, has uncovered some of the underlying dynamics by constructing deterministic approximation models of stochastic algorithms. In this work, we demonstrate that even in the simplest case of independent Q-learning with a Boltzmann exploration policy, significant discrepancies arise between the actual algorithm and previous approximations. We elaborate why these models actually approximate interesting variants rather than the original incremental algorithm. To explain the discrepancies, we introduce a new discrete-time approximation model that explicitly accounts for agents' update frequencies within the learning process and show that its dynamics fundamentally differ from the simplified dynamics of prior models. We illustrate the usefulness of our approach by applying it to the question of spontaneous cooperation in social dilemmas, specifically the Prisoner's Dilemma as the simplest case study. We identify conditions under which the learning behaviour appears as long-term stable cooperation from an external perspective. However, our model shows that this behaviour is merely a metastable transient phase and not a true equilibrium, making it exploitable. We further exemplify how specific parameter settings can significantly exacerbate the moving target problem in independent learning. Through a systematic analysis of our model, we show that increasing the discount factor induces oscillations, preventing convergence to a joint policy. These oscillations arise from a supercritical Neimark-Sacker bifurcation, which transforms the unique stable fixed point into an unstable focus surrounded by a stable limit cycle.

Deterministic Model of Incremental Multi-Agent Boltzmann Q-Learning: Transient Cooperation, Metastability, and Oscillations

TL;DR

This work investigates the dynamics of independent Q-learning with Boltzmann exploration in a two-agent, single-state Prisoner's Dilemma. It shows that prior deterministic approximations (FAQL/BQL) fail to capture the true learning dynamics, which exhibit metastable cooperation and long-lived oscillations due to a moving-target environment. The authors develop a discrete-time, choice-probability-aware model that preserves per-action update frequencies in the full -D Q-space and reveal a bifurcation-driven transition from a stable fixed point to oscillatory dynamics as the discount factor increases, via a Neimark–Sacker bifurcation. The results highlight fundamental limitations of reduced policy-space analyses for MARL and emphasize the need to account for update frequencies and non-stationarity when interpreting learning dynamics, with implications for designing robust MARL algorithms and extending the analysis to more complex environments.

Abstract

Multi-Agent Reinforcement Learning involves agents that learn together in a shared environment, leading to emergent dynamics sensitive to initial conditions and parameter variations. A Dynamical Systems approach, which studies the evolution of multi-component systems over time, has uncovered some of the underlying dynamics by constructing deterministic approximation models of stochastic algorithms. In this work, we demonstrate that even in the simplest case of independent Q-learning with a Boltzmann exploration policy, significant discrepancies arise between the actual algorithm and previous approximations. We elaborate why these models actually approximate interesting variants rather than the original incremental algorithm. To explain the discrepancies, we introduce a new discrete-time approximation model that explicitly accounts for agents' update frequencies within the learning process and show that its dynamics fundamentally differ from the simplified dynamics of prior models. We illustrate the usefulness of our approach by applying it to the question of spontaneous cooperation in social dilemmas, specifically the Prisoner's Dilemma as the simplest case study. We identify conditions under which the learning behaviour appears as long-term stable cooperation from an external perspective. However, our model shows that this behaviour is merely a metastable transient phase and not a true equilibrium, making it exploitable. We further exemplify how specific parameter settings can significantly exacerbate the moving target problem in independent learning. Through a systematic analysis of our model, we show that increasing the discount factor induces oscillations, preventing convergence to a joint policy. These oscillations arise from a supercritical Neimark-Sacker bifurcation, which transforms the unique stable fixed point into an unstable focus surrounded by a stable limit cycle.
Paper Structure (18 sections, 28 equations, 5 figures, 1 algorithm)

This paper contains 18 sections, 28 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Comparison between a single run of independent Q-learning on the Prisoner's Dilemma (top panels: A, B) and our deterministic approximation model (bottom panels: C, D), defined by \ref{['eq:QLmodelNEW']}, for $T=1$, $\alpha = 0.01$, $\gamma = 0.8$, $Q_{base} = 0$. Note that the depicted runs in A and B represent single instances of a stochastic process. Timings and trajectories vary across different runs. The first two subplots in each panel show the evolution of the $Q$-values ($Q^1_C, Q^1_D, Q^2_C, Q^2_D$), while the third subplot illustrates the resulting probabilities of cooperation ($\pi^1_C, \pi^2_C$). The dotted policy trajectories in C and D represent previous approximation methods: FAQL, defined by \ref{['eq:FAQL_model']}, and BQL, defined by \ref{['eq:BQL_model']}. The left panels (A, C) depict an initial joint policy $(\pi^1_C, \pi^2_C) = (0.5, 0.48)$, corresponding to $Q$-values $(0, 0, -0.04, 0.04)$ via \ref{['eq:Q-value_initialisation']}. The right panels (B, C) show an initial joint policy $(\pi^1_C, \pi^2_C) = (0.9, 0.7)$, corresponding to $Q$-values $(1.1, -1.1, 0.4, -0.4)$ via \ref{['eq:Q-value_initialisation']}.
  • Figure 2: Comparison between averaged policy trajectories of independent Q-learning on the Prisoner's Dilemma (I) and previous deterministic models (II) for $T=1$ and $\alpha=0.01$. I: Top panels (A, B): $Q_{base} = \min(\mathbf{R}) / (1-\gamma)$. Bottom panels (C, D): $Q_{base} = \max(\mathbf{R}) / (1-\gamma)$. Left panels (A, C): $\gamma = 0$. Right panels (B, D): $\gamma = 0.8$. For each initialisation, five runs are executed. The trajectories from the same initialisation are grouped based on their final location in policy space (below or above the diagonal from (0,1) to (1,0)), and the mean of each group is plotted. Line thickness indicates the proportion of runs in each group. The colour gradient (purple to yellow) indicates time evolution. The red cross marks the fixed point of the FAQL/BQL model. Note that for $Q_{base} = 0$ and $\gamma = 0.8$, some trajectories initialised in the top right appear to converge to the metastable phase of mutual cooperation in the depicted time span of $1 \times 10^5$ steps. II: Vector fields of previous models. E: FAQL model in continuous time, defined by \ref{['eq:FAQL_model']}. F: BQL model in discrete time, defined by \ref{['eq:BQL_model']}. G: Stability analysis of the BQL model (see appendix \ref{['sec:Appendix_BQL']}). It has a unique symmetric fixed point $\boldsymbol{\pi}_* > 0$, depending on the temperature $T > 0$. All absolute eigenvalues of the Jacobian at $\pi^i_{C*}$ are below 1, indicating a stable node.
  • Figure 3: Projection of our 4D deterministic approximation model of independent Q-learning on the Prisoner's Dilemma, defined by \ref{['eq:QLmodelNEW']}, into 2D policy space for $T=1$, $\alpha=0.01$, and different values of $\gamma$ and $Q_{base}$. The colour gradient (purple to yellow) represents time evolution. The end point of each trajectory is indicated by a red cross. Top panels (A, B): $Q_{base} = \min(\mathbf{R}) / (1-\gamma)$. Bottom panels (C, D): $Q_{base} = \max(\mathbf{R}) / (1-\gamma)$. Left panels (A, C): $\gamma = 0$. Right panels (B, D): $\gamma = 0.8$. Note that in panel B, the trajectory initialised at $\pi^i_{C}(0) = 0.9$ eventually converges to the fixed point $\pi^i_{C*} \approx 0.227$, but only after $4 \times 10^7$ steps, far beyond the depicted $2 \times 10^6$ steps.
  • Figure 4: Stability analysis of our model, defined by \ref{['eq:QLmodelNEW']}, for $\alpha = 0.01$ and three different temperature values: $T=0.3$ (A), $T=1$ (B), and $T=10$ (C). The deterministic 4D system shares the same unique symmetric fixed point $\boldsymbol{\pi}_* = \boldsymbol{\pi}(\mathbf {Q}_*)$ in policy space as the 2D FAQL/BQL model (figure \ref{['fig:2']}). The first row shows the position of the 4D fixed point $\bf Q_*$, defined by \ref{['eq:FixedPointQ']}, in 2D policy space. Specifically, it illustrates how the projected equilibrium policy $\pi^i_{C*} := \pi^i_{C}(\mathbf {Q}_*)$ is not affected by the discount factor. The second row shows the absolute eigenvalues of the Jacobian matrix at the 4D fixed point $\mathbf {Q}_*$ as a function of $\gamma$, with the stability threshold ($|\lambda| = 1$) highlighted. It demonstrates that although the position of the fixed point in policy space remains unaffected by $\gamma$, its stability properties changes. For instance, at $T=1$, the dynamics undergoes a supercritical Neimark-Sacker bifurcation at $\gamma_{cr_1} \approx 0.75$. The third row provides schematic representations of the corresponding dynamical regimes for different ranges of $\gamma$, illustrating transitions between stability, oscillatory dynamics, and divergence.
  • Figure 5: Projection of 4D deterministic dynamics of independent Q-learning on the Prisoner's Dilemma, defined by \ref{['eq:QLmodelNEW']}, for $T=1$, $\alpha=0.01$ and different values of $\gamma$. Left panels (A, D, G): $\gamma = 0.7$. Middle panels (B, E, F): $\gamma = 0.8$. Right panels (G, H, I): $\gamma = 0.97$. All trajectories are initialised around the fixed point $Q$-values, defined by \ref{['eq:FixedPointQ']}: $Q_{base} = Q_{C*} + \Delta Q_*/2$. The colour gradient (purple to yellow) represents time evolution over $3 \times 10^4$ steps. The end point of each trajectory is indicated by a red cross. Top panels (A, B, C): Projection of 4D dynamics into 2D policy space. Middle panels (D, E, F): Projection into a 3D space defined by the basis vectors $\mathbf{q}_1 = (1, -1, 0, 0)$, $\mathbf{q}_2 = (0, 0, 1, -1)$, and $\mathbf{q}_3 = (1, 1, -1, -1)$. The first two dimensions represent the $\Delta Q^i$-values, while the third dimension captures the difference between agents. Bottom panels (G, H, I): Projection into the same 3D space, viewed from a different angle. For $\gamma = 0.7$ and $\gamma = 0.8$, only the last two-thirds of the time evolution are shown for clarity. For $\gamma = 0.7$, the unique fixed point $\pi^i_{C*}$ is a stable focus. For $\gamma = 0.8$, it is an unstable focus surrounded by a stable limit cycle for all asymmetric joint policies. For $\gamma = 0.97$, it is a saddle point, with stable eigenvectors projected onto the diagonal of the policy space and unstable eigenvectors directed perpendicular to it. The trajectory initialised at $\pi^i_C(0) = 0.9$ remains at mutual cooperation $(\pi^i_C \approx 1)$ within any finite number of steps feasible for computational simulation. Note however that the equations show that this is not a true fixed point and pure policies are prohibited due to $T>0$.