Table of Contents
Fetching ...

A non-zero-sum game with reinforcement learning under mean-variance framework

Junyi Guo, Xia Han, Hao Wang, Kam Chuen Yuen

TL;DR

This work tackles a two-agent competitive portfolio problem under a time-inconsistent mean-variance criterion in an incomplete market. It develops a dynamic-programming–based, time-consistent Nash equilibrium using a verification framework and Choquet regularizers to model exploration, with explicit solutions in a Gaussian mean-return setting. The authors formulate a policy-iteration and actor–critic RL algorithm that learns the equilibrium, proving uniform convergence in the time-inconsistent context and demonstrating robustness through numerical experiments. The approach offers a practical pathway to compute and implement equilibrium strategies in multi-agent financial environments with controllable exploration.

Abstract

In this paper, we investigate a competitive market involving two agents who consider both their own wealth and the wealth gap with their opponent. Both agents can invest in a financial market consisting of a risk-free asset and a risky asset, under conditions where model parameters are partially or completely unknown. This setup gives rise to a non-zero-sum differential game within the framework of reinforcement learning (RL). Each agent aims to maximize his own Choquet-regularized, time-inconsistent mean-variance objective. Adopting the dynamic programming approach, we derive a time-consistent Nash equilibrium strategy in a general incomplete market setting. Under the additional assumption of a Gaussian mean return model, we obtain an explicit analytical solution, which facilitates the development of a practical RL algorithm. Notably, the proposed algorithm achieves uniform convergence, even though the conventional policy improvement theorem does not apply to the equilibrium policy. Numerical experiments demonstrate the robustness and effectiveness of the algorithm, underscoring its potential for practical implementation.

A non-zero-sum game with reinforcement learning under mean-variance framework

TL;DR

This work tackles a two-agent competitive portfolio problem under a time-inconsistent mean-variance criterion in an incomplete market. It develops a dynamic-programming–based, time-consistent Nash equilibrium using a verification framework and Choquet regularizers to model exploration, with explicit solutions in a Gaussian mean-return setting. The authors formulate a policy-iteration and actor–critic RL algorithm that learns the equilibrium, proving uniform convergence in the time-inconsistent context and demonstrating robustness through numerical experiments. The approach offers a practical pathway to compute and implement equilibrium strategies in multi-agent financial environments with controllable exploration.

Abstract

In this paper, we investigate a competitive market involving two agents who consider both their own wealth and the wealth gap with their opponent. Both agents can invest in a financial market consisting of a risk-free asset and a risky asset, under conditions where model parameters are partially or completely unknown. This setup gives rise to a non-zero-sum differential game within the framework of reinforcement learning (RL). Each agent aims to maximize his own Choquet-regularized, time-inconsistent mean-variance objective. Adopting the dynamic programming approach, we derive a time-consistent Nash equilibrium strategy in a general incomplete market setting. Under the additional assumption of a Gaussian mean return model, we obtain an explicit analytical solution, which facilitates the development of a practical RL algorithm. Notably, the proposed algorithm achieves uniform convergence, even though the conventional policy improvement theorem does not apply to the equilibrium policy. Numerical experiments demonstrate the robustness and effectiveness of the algorithm, underscoring its potential for practical implementation.

Paper Structure

This paper contains 12 sections, 8 theorems, 95 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

For $i,j\in\{1,2\}$ and $i\neq j$, fix $\Pi_j$ and suppose that functions $W_i(t,x,y)\in C^{1,2,2}(\mathcal{D})$, $g_i(t,x,y)\in C^{1,2,2}(\mathcal{D})$ and strategy $\Pi_i$ satisfy the following properties: Then $\Pi_i$ is the equilibrium response of Agent $i$. Furthermore, $W_i(t,\hat{x}_i,y)=J_i(t,\hat{x}_i,y;\Pi_i,\Pi_j)$ is the equilibrium response value function of Agent $i$ and $g_i(t,\hat

Figures (2)

  • Figure 1: The effects of $t$, $k_1$, $k_2$, $\gamma_1$, and $\gamma_2$ on the Nash equilibrium
  • Figure 2: The mean value of Nash equilibrium

Theorems & Definitions (17)

  • Example 1
  • Remark 1
  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1: Verification theorem
  • Lemma 1: Theorem 3.1 of LCLW20
  • Proposition 1
  • proof
  • Theorem 2
  • ...and 7 more