Continuous-time q-learning for mean-field control problems

Xiaoli Wei; Xiang Yu

Continuous-time q-learning for mean-field control problems

Xiaoli Wei, Xiang Yu

TL;DR

It is revealed that two different q-functions naturally arise in mean-field control problems and two q-functions are related via an integral representation via an integral representation.

Abstract

This paper studies the q-learning, recently coined as the continuous time counterpart of Q-learning by Jia and Zhou (2023), for continuous time Mckean-Vlasov control problems in the setting of entropy-regularized reinforcement learning. In contrast to the single agent's control problem in Jia and Zhou (2023), the mean-field interaction of agents renders the definition of the q-function more subtle, for which we reveal that two distinct q-functions naturally arise: (i) the integrated q-function (denoted by $q$) as the first-order approximation of the integrated Q-function introduced in Gu, Guo, Wei and Xu (2023), which can be learnt by a weak martingale condition involving test policies; and (ii) the essential q-function (denoted by $q_e$) that is employed in the policy improvement iterations. We show that two q-functions are related via an integral representation under all test policies. Based on the weak martingale condition and our proposed searching method of test policies, some model-free learning algorithms are devised. In two examples, one in LQ control framework and one beyond LQ control framework, we can obtain the exact parameterization of the optimal value function and q-functions and illustrate our algorithms with simulation experiments.

Continuous-time q-learning for mean-field control problems

TL;DR

It is revealed that two different q-functions naturally arise in mean-field control problems and two q-functions are related via an integral representation via an integral representation.

Abstract

) as the first-order approximation of the integrated Q-function introduced in Gu, Guo, Wei and Xu (2023), which can be learnt by a weak martingale condition involving test policies; and (ii) the essential q-function (denoted by

) that is employed in the policy improvement iterations. We show that two q-functions are related via an integral representation under all test policies. Based on the weak martingale condition and our proposed searching method of test policies, some model-free learning algorithms are devised. In two examples, one in LQ control framework and one beyond LQ control framework, we can obtain the exact parameterization of the optimal value function and q-functions and illustrate our algorithms with simulation experiments.

Paper Structure (14 sections, 7 theorems, 114 equations, 2 figures, 2 tables, 2 algorithms)

This paper contains 14 sections, 7 theorems, 114 equations, 2 figures, 2 tables, 2 algorithms.

Introduction
Problem Formulation
Strong Control Formulation
Exploratory Formulation
q-Functions for Continuous Time Mean-field Control
Soft Q-learning for Mean-field Control
Two Continuous Time q-functions
Weak Martingale Characterizations
Algorithms under Continuous Time q-Learning
Financial Applications
Mean-Variance Portfolio Optimization
Mean-Field Optimal Consumption Problem
Conclusion
Connection between formulations (\ref{['equ:exploratory_SDE']}) and (\ref{['equ:exploratory_average_SDE']})

Key Result

Theorem 2.2

For any given ${\bm \pi} \in \Pi$, define ${\bm \pi}' = \mathcal{I}({\bm \pi})$, with $\mathcal{I}$ given in (equ:policy_improvemet_map). Then Moreover, if the map $\mathcal{I}$ in (equ:policy_improvemet_map) has a fixed point ${\bm \pi}^* \in \Pi$, then ${\bm \pi}^*$ is the optimal policy of (equ:optimal_value_function).

Figures (2)

Figure 1: Convergence of value function and q-function for Algorithm \ref{['algo:offline episodic ml-method1']}. Paths of learnt parameters for value function (top left) and paths of learnt parameters for q-function (top right) vs optimal parameters shown in the dashed line. The change of the weak martingale loss value over iterations (bottom left) and the change of $L^2$- error over iterations (bottom right) along a trajectory $(\bar{\mu}_{t_k}, {\rm Var}(\mu)_{t_k})_k$ with $\bar{\mu}_0 =0$ and ${\rm Var}(\mu)_0 =0.5$ controlled by the learnt policy and by the optimal policy.
Figure 2: Convergence of value function and q-function for Algorithm \ref{['algo:offline episodic ml-method1']}. Paths of learnt parameters for value function (top left) and path of learnt parameters for q-function (top right) vs optimal parameters shown in the dashed line. The change of the weak martingale loss value over iterations (bottom left) and the change of $L^2$-error over iterations (bottom right) along a trajectory $(\log (\bar{\mu}_{t_k}))_k$ with $\log(\bar{\mu}_0) =0$ controlled by the learnt policy and by the optimal policy.

Theorems & Definitions (19)

Theorem 2.2: Policy improvement result
Lemma 2.3
Definition 3.1
Remark 3.2
Definition 3.3
Lemma 3.4
Remark 3.5
Theorem 4.1: Characterization of the integrated q-function
Remark 4.2
Theorem 4.3: Characterization of the value function and the integrated q-function
...and 9 more

Continuous-time q-learning for mean-field control problems

TL;DR

Abstract

Continuous-time q-learning for mean-field control problems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (19)