Table of Contents
Fetching ...

Deep Reinforcement Learning for Online Optimal Execution Strategies

Alessandro Micheli, Mélodie Monod

TL;DR

A novel actor-critic algorithm based on Deep Deterministic Policy Gradient (DDPG) is introduced to address the challenge of learning non-Markovian optimal execution strategies in dynamic financial markets, with a focus on transient price impact modeled by a general decay kernel.

Abstract

This paper tackles the challenge of learning non-Markovian optimal execution strategies in dynamic financial markets. We introduce a novel actor-critic algorithm based on Deep Deterministic Policy Gradient (DDPG) to address this issue, with a focus on transient price impact modeled by a general decay kernel. Through numerical experiments with various decay kernels, we show that our algorithm successfully approximates the optimal execution strategy. Additionally, the proposed algorithm demonstrates adaptability to evolving market conditions, where parameters fluctuate over time. Our findings also show that modern reinforcement learning algorithms can provide a solution that reduces the need for frequent and inefficient human intervention in optimal execution tasks.

Deep Reinforcement Learning for Online Optimal Execution Strategies

TL;DR

A novel actor-critic algorithm based on Deep Deterministic Policy Gradient (DDPG) is introduced to address the challenge of learning non-Markovian optimal execution strategies in dynamic financial markets, with a focus on transient price impact modeled by a general decay kernel.

Abstract

This paper tackles the challenge of learning non-Markovian optimal execution strategies in dynamic financial markets. We introduce a novel actor-critic algorithm based on Deep Deterministic Policy Gradient (DDPG) to address this issue, with a focus on transient price impact modeled by a general decay kernel. Through numerical experiments with various decay kernels, we show that our algorithm successfully approximates the optimal execution strategy. Additionally, the proposed algorithm demonstrates adaptability to evolving market conditions, where parameters fluctuate over time. Our findings also show that modern reinforcement learning algorithms can provide a solution that reduces the need for frequent and inefficient human intervention in optimal execution tasks.

Paper Structure

This paper contains 15 sections, 2 theorems, 25 equations, 3 figures, 1 algorithm.

Key Result

Proposition 3.1

Let $\mathcal{Q}^{*}$ be the optimal auxiliary Q-function. Then, $\mathcal{Q}^{*}$ satisfies the following Bellman equation with $s'\sim P(s' \, | \, s ,a)$ and for any $s\in\mathcal{S}$ and $a\in\mathcal{A}(s)$.

Figures (3)

  • Figure 1: Optimal Execution Strategies for Exponential Kernel, Power Law Kernel and Linear Resilience Kernel. Blue bars represent the optimal strategy estimated by the DDPG agent and the red bars are the true optimal strategy (Equation \ref{['eq:optimal_strategy']}).
  • Figure 2: Advantage of Auxiliary Q-function. Episode reward of the true optimal strategy (red), the optimal strategy estimated by the DDPG trained with the auxiliary Q-function \ref{['eq-auxiliary-bellman']} (blue) and trained with the standard Q-function \ref{['eq-bellman-equation']} (green). The experiment design is the same as that explained in Section \ref{['sec:convergence_experiment']} for the exponential kernel.
  • Figure 3: Online Learning in a Dynamic Environment. (a) Exponential kernel decay parameter in a dynamic environment for two scenarios. (b) Episode reward of the estimated optimal strategy (dark blue) and the executed strategy (light blue) of the DDPG agent relative to the episode reward of the true optimal strategy (Equation \ref{['eq:optimal_strategy']}) (red).

Theorems & Definitions (5)

  • Definition 2.1
  • Remark 2.3
  • Proposition 3.1: Auxiliary Bellman Equation
  • Proposition 3.2: Auxiliary Policy Gradient
  • proof : Proof of Proposition \ref{['prop-q-learning']}