Table of Contents
Fetching ...

Two Kinds of Learning Algorithms for Continuous-Time VWAP Targeting Execution

Xingyu Zhou, Wenbin Chen, Mingyu Xu

Abstract

The optimal execution problem has always been a continuously focused research issue, and many reinforcement learning (RL) algorithms have been studied. In this article, we consider the execution problem of targeting the volume weighted average price (VWAP) and propose a relaxed stochastic optimization problem with an entropy regularizer to encourage more exploration. We derive the explicit formula of the optimal policy, which is Gaussian distributed, with its mean value being the solution to the original problem. Extending the framework of continuous RL to processes with jumps, we provide some theoretical proofs for RL algorithms. First, minimizing the martingale loss function leads to the optimal parameter estimates in the mean-square sense, and the second algorithm is to use the martingale orthogonality condition. In addition to the RL algorithm, we also propose another learning algorithm: adaptive dynamic programming (ADP) algorithm, and verify the performance of both in two different environments across different random seeds. Convergence of all algorithms has been verified in different environments, and shows a larger advantage in the environment with stronger price impact. ADP is a good choice when the agent fully understands the environment and can estimate the parameters well. On the other hand, RL algorithms do not require any model assumptions or parameter estimation, and are able to learn directly from interactions with the environment.

Two Kinds of Learning Algorithms for Continuous-Time VWAP Targeting Execution

Abstract

The optimal execution problem has always been a continuously focused research issue, and many reinforcement learning (RL) algorithms have been studied. In this article, we consider the execution problem of targeting the volume weighted average price (VWAP) and propose a relaxed stochastic optimization problem with an entropy regularizer to encourage more exploration. We derive the explicit formula of the optimal policy, which is Gaussian distributed, with its mean value being the solution to the original problem. Extending the framework of continuous RL to processes with jumps, we provide some theoretical proofs for RL algorithms. First, minimizing the martingale loss function leads to the optimal parameter estimates in the mean-square sense, and the second algorithm is to use the martingale orthogonality condition. In addition to the RL algorithm, we also propose another learning algorithm: adaptive dynamic programming (ADP) algorithm, and verify the performance of both in two different environments across different random seeds. Convergence of all algorithms has been verified in different environments, and shows a larger advantage in the environment with stronger price impact. ADP is a good choice when the agent fully understands the environment and can estimate the parameters well. On the other hand, RL algorithms do not require any model assumptions or parameter estimation, and are able to learn directly from interactions with the environment.

Paper Structure

This paper contains 18 sections, 6 theorems, 88 equations, 5 figures, 4 tables, 3 algorithms.

Key Result

Theorem 1

If Assumption assumption1 holds, given an admissible policy $\pi$, (ex_equation_vec) admits a unique strong solution. Furthermore, if $p\geq2$, then there exists $C=C(p)$ satisfying And the value function (value_V) is finite.

Figures (5)

  • Figure 1: Ten samples of market trading speed process (left) and stock price process (right).
  • Figure 2: Training curves in Environment 1 and 2. The different colors represent training results under different random seeds, and the yellow dashed line indicates the return gained by the optimal policy. The top five figures (Figure(a)-(e)) show the training results in Environment 1, and the bottom five figures (Figure(f)-(j)) show the training results in Environment 2. The solid curves indicates the mean value of the five out-of-sample tests at the end of each training session, and the shading covers the area between the minimum and maximum values of the five tests.
  • Figure 3: MSE between the mean value and optimal policy in Environment 1 and 2. The top five figures (Figure(a)-(e)) show the training results in Environment 1, and the bottom five figures (Figure(f)-(j)) show the training results in Environment 2. The different colors represent training results under different random seeds.
  • Figure 4: The optimal exploratory policy in Environment 1. The shade of colors represents the likelihood of taking the corresponding action at that moment, with darker colors representing greater probability.
  • Figure 5: The optimal exploratory policy in Environment 2. The shade of colors represents the likelihood of taking the corresponding action at that moment, with darker colors representing greater probability.

Theorems & Definitions (13)

  • Definition 1
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Proposition 1
  • proof
  • Theorem 3
  • proof
  • Proposition 2
  • ...and 3 more