Analysis of Multiscale Reinforcement Q-Learning Algorithms for Mean Field Control Games

Andrea Angiuli; Jean-Pierre Fouque; Mathieu Laurière; Mengrui Zhang

Analysis of Multiscale Reinforcement Q-Learning Algorithms for Mean Field Control Games

Andrea Angiuli, Jean-Pierre Fouque, Mathieu Laurière, Mengrui Zhang

TL;DR

The paper proves convergence of a model-free, three-timescale Q-learning algorithm for Mean Field Control Games (MFCG) with finite state/action spaces, where updates target a representative agent’s value, a local population distribution, and a global population distribution on distinct timescales. The three learning rates satisfy $\frac12<\omega^{\tilde{\mu}}<\omega^Q<\omega^{\mu}<1$, enabling fast local, intermediate Q, and slow global updates, and the analysis builds a three-timescale extension of Borkar’s framework to show convergence in ideal, synchronous-stochastic, and asynchronous settings. A key theoretical mechanism is the introduction of a soft-min policy parameterized by $\phi$, yielding a convergent fixed point $({\mu^{*\phi}}, {Q^{*\phi}_{\mu^{*\phi}}}, \tilde{\mu}^{*\phi})$ that approximates the true MFCG solution as $\phi\to\infty$ (with a positive action-gap $\delta(\phi)$). The work also provides a simple numerical example illustrating convergence and discusses contraction/Lyapunov arguments necessary for the three-timescale analysis, highlighting its significance for scalable RL in mixed cooperative-competitive mean-field settings.

Abstract

Mean Field Control Games (MFCG), introduced in [Angiuli et al., 2022a], represent competitive games between a large number of large collaborative groups of agents in the infinite limit of number and size of groups. In this paper, we prove the convergence of a three-timescale Reinforcement Q-Learning (RL) algorithm to solve MFCG in a model-free approach from the point of view of representative agents. Our analysis uses a Q-table for finite state and action spaces updated at each discrete time-step over an infinite horizon. In [Angiuli et al., 2023], we proved convergence of two-timescale algorithms for MFG and MFC separately highlighting the need to follow multiple population distributions in the MFC case. Here, we integrate this feature for MFCG as well as three rates of update decreasing to zero in the proper ratios. Our technique of proof uses a generalization to three timescales of the two-timescale analysis in [Borkar, 1997]. We give a simple example satisfying the various hypothesis made in the proof of convergence and illustrating the performance of the algorithm.

Analysis of Multiscale Reinforcement Q-Learning Algorithms for Mean Field Control Games

TL;DR

, enabling fast local, intermediate Q, and slow global updates, and the analysis builds a three-timescale extension of Borkar’s framework to show convergence in ideal, synchronous-stochastic, and asynchronous settings. A key theoretical mechanism is the introduction of a soft-min policy parameterized by

, yielding a convergent fixed point

that approximates the true MFCG solution as

(with a positive action-gap

). The work also provides a simple numerical example illustrating convergence and discusses contraction/Lyapunov arguments necessary for the three-timescale analysis, highlighting its significance for scalable RL in mixed cooperative-competitive mean-field settings.

Abstract

Paper Structure (24 sections, 14 theorems, 89 equations, 2 figures, 1 algorithm)

This paper contains 24 sections, 14 theorems, 89 equations, 2 figures, 1 algorithm.

Introduction
Background
Structure of the Paper
Model and MFCG Formulation
Model and Notations
Classical Q-learning.
Notations.
MFCG Definition
Q-learning for MFCG
Algorithms and Multiscale Learning Rates
Full Algorithm
Synchronous Algorithm with Stochastic Approximation
Idealized Three-timescale Algorithm
Convergence: Three-timescale Approach
Convergence of the Idealized Three-timescale Algorithm
...and 9 more sections

Key Result

Proposition 4.2

If Assumption mfcglip holds, then the function $(\mu,Q,\tilde{\mu})\mapsto\mathcal{P}_3(\mu,Q,\tilde{\mu})$ is Lipschitz and more precisely: and

Figures (2)

Figure 1: MFCG setting: Convergence of the distribution. The plot represents the value of $\mu_n(x_0)$ and $\mu^{\alpha^*}_n(x_0)$ as a function of $n$.
Figure 2: MFCG setting: Convergence of the distribution. The plot represents the value of $\mu_n(x_0)$ and $\mu^{\alpha^*}_n(x_0)$ as a function of $n$.

Theorems & Definitions (32)

Definition 2.1
Proposition 4.2
proof
Proposition 4.3
proof
Proposition 4.4
proof
Proposition 4.6
proof
Theorem 4.10
...and 22 more

Analysis of Multiscale Reinforcement Q-Learning Algorithms for Mean Field Control Games

TL;DR

Abstract

Analysis of Multiscale Reinforcement Q-Learning Algorithms for Mean Field Control Games

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (32)