Table of Contents
Fetching ...

Continuous-time q-Learning for Jump-Diffusion Models under Tsallis Entropy

Lijun Bo, Yijie Huang, Xiang Yu, Tingting Zhang

TL;DR

This paper studies the continuous-time reinforcement learning in jump-diffusion models by featuring the q-learning (the continuous-time counterpart of Q-learning) under Tsallis entropy regularization, and finds that the optimal policies under the Tsallis entropy regularization can be characterized explicitly.

Abstract

This paper studies the continuous-time reinforcement learning in jump-diffusion models by featuring the q-learning (the continuous-time counterpart of Q-learning) under Tsallis entropy regularization. Contrary to the Shannon entropy, the general form of Tsallis entropy renders the optimal policy not necessarily a Gibbs measure. Herein, the Lagrange multiplier and KKT condition are needed to ensure that the learned policy is a probability density function. As a consequence, the characterization of the optimal policy using the q-function also involves a Lagrange multiplier. In response, we establish the martingale characterization of the q-function and devise two q-learning algorithms depending on whether the Lagrange multiplier can be derived explicitly or not. In the latter case, we consider different parameterizations of the optimal q-function and the optimal policy, and update them alternatively in an Actor-Critic manner. We also study two numerical examples, namely, an optimal liquidation problem in dark pools and a non-LQ control problem. It is interesting to see therein that the optimal policies under the Tsallis entropy regularization can be characterized explicitly, which are distributions concentrated on some compact support. The satisfactory performance of our q-learning algorithms is illustrated in each example.

Continuous-time q-Learning for Jump-Diffusion Models under Tsallis Entropy

TL;DR

This paper studies the continuous-time reinforcement learning in jump-diffusion models by featuring the q-learning (the continuous-time counterpart of Q-learning) under Tsallis entropy regularization, and finds that the optimal policies under the Tsallis entropy regularization can be characterized explicitly.

Abstract

This paper studies the continuous-time reinforcement learning in jump-diffusion models by featuring the q-learning (the continuous-time counterpart of Q-learning) under Tsallis entropy regularization. Contrary to the Shannon entropy, the general form of Tsallis entropy renders the optimal policy not necessarily a Gibbs measure. Herein, the Lagrange multiplier and KKT condition are needed to ensure that the learned policy is a probability density function. As a consequence, the characterization of the optimal policy using the q-function also involves a Lagrange multiplier. In response, we establish the martingale characterization of the q-function and devise two q-learning algorithms depending on whether the Lagrange multiplier can be derived explicitly or not. In the latter case, we consider different parameterizations of the optimal q-function and the optimal policy, and update them alternatively in an Actor-Critic manner. We also study two numerical examples, namely, an optimal liquidation problem in dark pools and a non-LQ control problem. It is interesting to see therein that the optimal policies under the Tsallis entropy regularization can be characterized explicitly, which are distributions concentrated on some compact support. The satisfactory performance of our q-learning algorithms is illustrated in each example.
Paper Structure (13 sections, 12 theorems, 139 equations, 7 figures, 3 tables, 2 algorithms)

This paper contains 13 sections, 12 theorems, 139 equations, 7 figures, 3 tables, 2 algorithms.

Key Result

Lemma 2.1

Let Assumptions (A$_{1}$) and (A$_{2}$) hold. Consider $(t,x)\in[0,T]\times\mathds{R}^n$, $\pi\in \Pi_t$ and $(\Omega', \mathcal{F}',\mathbb{F} ', \mathbb{Q} )$ which is the probability space specified in eq:space. Then, we have

Figures (7)

  • Figure 1: (a) The optimal policy $(u_1,u_2)\to \hat{\pi}(u_1,u_2)$ with $p=1$. (b): The optimal policy $(u_1,u_2)\to \hat{\pi}(u_1,u_2)$ with $p=2$. (c): The optimal policy $(u_1,u_2)\to \hat{\pi}(u_1,u_2)$ with $p=3$. The model parameters are set to be $\lambda=1,~\kappa=1,~c=1,~\gamma=1,~t=1,~T=2,~x=5$.
  • Figure 2: (a): The compact support set of $u_1$. (b) The compact support set of $u_2$. The model parameters are set to be $\lambda=1,~\kappa=1,~c=1,~\gamma=0.001,~t=1,~T=2,~x=5$.
  • Figure 3: (a): The compact support set of $u_1$. (b) The compact support set of $u_2$. The model parameters are set to be $\lambda=1,~\kappa=1,~c=1,~\gamma=0.1,~t=1,~T=2,~x=5$.
  • Figure 4: Convergence of Algorithm \ref{['Alg:Tsallis-q-Learning']} using a market simulator. The upper panels show the convergence of parameter iterations for $(\theta_1,\theta_2,\theta_3,\theta_4,\theta_5,\zeta_1,\zeta_2,\zeta_3,\zeta_4,\zeta_5,\zeta_6)$; the bottom panel shows the value error along the iterations.
  • Figure 5: The optimal policy $(u_1,u_2)\to \widehat{\pi}(u_1,u_2)$. The model parameters are set to be $\lambda=1,~\sigma=1,~\nu=0.5,~A=B=1,~\mu_1=\mu_2=0.5,~h=1.5,~\gamma=1,~t=1,~T=2,~x=1$.
  • ...and 2 more figures

Theorems & Definitions (28)

  • Definition 2.1
  • Lemma 2.1
  • proof
  • Remark 2.2
  • Theorem 2.3
  • proof
  • Corollary 2.4
  • Remark 2.5
  • Theorem 2.6: Policy Improvement Iteration
  • Lemma 2.7
  • ...and 18 more