Continuous-time q-Learning for Jump-Diffusion Models under Tsallis Entropy

Lijun Bo; Yijie Huang; Xiang Yu; Tingting Zhang

Continuous-time q-Learning for Jump-Diffusion Models under Tsallis Entropy

Lijun Bo, Yijie Huang, Xiang Yu, Tingting Zhang

TL;DR

This paper studies the continuous-time reinforcement learning in jump-diffusion models by featuring the q-learning (the continuous-time counterpart of Q-learning) under Tsallis entropy regularization, and finds that the optimal policies under the Tsallis entropy regularization can be characterized explicitly.

Abstract

This paper studies the continuous-time reinforcement learning in jump-diffusion models by featuring the q-learning (the continuous-time counterpart of Q-learning) under Tsallis entropy regularization. Contrary to the Shannon entropy, the general form of Tsallis entropy renders the optimal policy not necessarily a Gibbs measure. Herein, the Lagrange multiplier and KKT condition are needed to ensure that the learned policy is a probability density function. As a consequence, the characterization of the optimal policy using the q-function also involves a Lagrange multiplier. In response, we establish the martingale characterization of the q-function and devise two q-learning algorithms depending on whether the Lagrange multiplier can be derived explicitly or not. In the latter case, we consider different parameterizations of the optimal q-function and the optimal policy, and update them alternatively in an Actor-Critic manner. We also study two numerical examples, namely, an optimal liquidation problem in dark pools and a non-LQ control problem. It is interesting to see therein that the optimal policies under the Tsallis entropy regularization can be characterized explicitly, which are distributions concentrated on some compact support. The satisfactory performance of our q-learning algorithms is illustrated in each example.

Continuous-time q-Learning for Jump-Diffusion Models under Tsallis Entropy

TL;DR

This paper studies the continuous-time reinforcement learning in jump-diffusion models by featuring the q-learning (the continuous-time counterpart of Q-learning) under Tsallis entropy regularization, and finds that the optimal policies under the Tsallis entropy regularization can be characterized explicitly.

Abstract

This paper studies the continuous-time reinforcement learning in jump-diffusion models by featuring the q-learning (the continuous-time counterpart of Q-learning) under Tsallis entropy regularization. Contrary to the Shannon entropy, the general form of Tsallis entropy renders the optimal policy not necessarily a Gibbs measure. Herein, the Lagrange multiplier and KKT condition are needed to ensure that the learned policy is a probability density function. As a consequence, the characterization of the optimal policy using the q-function also involves a Lagrange multiplier. In response, we establish the martingale characterization of the q-function and devise two q-learning algorithms depending on whether the Lagrange multiplier can be derived explicitly or not. In the latter case, we consider different parameterizations of the optimal q-function and the optimal policy, and update them alternatively in an Actor-Critic manner. We also study two numerical examples, namely, an optimal liquidation problem in dark pools and a non-LQ control problem. It is interesting to see therein that the optimal policies under the Tsallis entropy regularization can be characterized explicitly, which are distributions concentrated on some compact support. The satisfactory performance of our q-learning algorithms is illustrated in each example.

Paper Structure (13 sections, 12 theorems, 139 equations, 7 figures, 3 tables, 2 algorithms)

This paper contains 13 sections, 12 theorems, 139 equations, 7 figures, 3 tables, 2 algorithms.

Introduction
Problem Formulation
Exploratory formulation in reinforcement learning
Weak convergence of sampled dynamics to relaxed version
Exploratory HJB equation and policy improvement iteration
Continuous-time q-Function and Martingale Characterization under Tsallis Entropy
q-Learning Algorithms under Tsallis Entropy
q-Learning algorithm when the normalizing function is available
q-Learning algorithm when the normalizing function is unavailable
Applications and Numerical Examples
The optimal portfolio liquidation problem
A non-LQ optimal repo rate control problem
Conclusions

Key Result

Lemma 2.1

Let Assumptions (A$_{1}$) and (A$_{2}$) hold. Consider $(t,x)\in[0,T]\times\mathds{R}^n$, $\pi\in \Pi_t$ and $(\Omega', \mathcal{F}',\mathbb{F} ', \mathbb{Q} )$ which is the probability space specified in eq:space. Then, we have

Figures (7)

Figure 1: (a) The optimal policy $(u_1,u_2)\to \hat{\pi}(u_1,u_2)$ with $p=1$. (b): The optimal policy $(u_1,u_2)\to \hat{\pi}(u_1,u_2)$ with $p=2$. (c): The optimal policy $(u_1,u_2)\to \hat{\pi}(u_1,u_2)$ with $p=3$. The model parameters are set to be $\lambda=1,~\kappa=1,~c=1,~\gamma=1,~t=1,~T=2,~x=5$.
Figure 2: (a): The compact support set of $u_1$. (b) The compact support set of $u_2$. The model parameters are set to be $\lambda=1,~\kappa=1,~c=1,~\gamma=0.001,~t=1,~T=2,~x=5$.
Figure 3: (a): The compact support set of $u_1$. (b) The compact support set of $u_2$. The model parameters are set to be $\lambda=1,~\kappa=1,~c=1,~\gamma=0.1,~t=1,~T=2,~x=5$.
Figure 4: Convergence of Algorithm \ref{['Alg:Tsallis-q-Learning']} using a market simulator. The upper panels show the convergence of parameter iterations for $(\theta_1,\theta_2,\theta_3,\theta_4,\theta_5,\zeta_1,\zeta_2,\zeta_3,\zeta_4,\zeta_5,\zeta_6)$; the bottom panel shows the value error along the iterations.
Figure 5: The optimal policy $(u_1,u_2)\to \widehat{\pi}(u_1,u_2)$. The model parameters are set to be $\lambda=1,~\sigma=1,~\nu=0.5,~A=B=1,~\mu_1=\mu_2=0.5,~h=1.5,~\gamma=1,~t=1,~T=2,~x=1$.
...and 2 more figures

Theorems & Definitions (28)

Definition 2.1
Lemma 2.1
proof
Remark 2.2
Theorem 2.3
proof
Corollary 2.4
Remark 2.5
Theorem 2.6: Policy Improvement Iteration
Lemma 2.7
...and 18 more

Continuous-time q-Learning for Jump-Diffusion Models under Tsallis Entropy

TL;DR

Abstract

Continuous-time q-Learning for Jump-Diffusion Models under Tsallis Entropy

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (28)