Table of Contents
Fetching ...

Optimal Execution with Reinforcement Learning

Yadh Hafsi, Edoardo Vittori

TL;DR

This work tackles optimal execution of large trades within a finite horizon by learning execution policies through reinforcement learning inside a multi-agent limit order book simulator ABIDES. It formulates the problem as a finite-horizon MDP and adopts a Deep Q-Network (DQN) to learn strategies that balance implementation shortfall and market impact over a 30-minute window with 1-second control. The approach integrates strategic scheduling with tactical order placement by leveraging rich LOB features and endogenously generated market impacts, demonstrating superior performance over TWAP, Passive, and Random baselines with reduced variance. The findings suggest a practical, RL-based framework for real-world high-frequency execution that adapts to dynamic liquidity conditions and agent interactions; future work could explore more tractable simulators to broaden benchmarking.

Abstract

This study investigates the development of an optimal execution strategy through reinforcement learning, aiming to determine the most effective approach for traders to buy and sell inventory within a finite time horizon. Our proposed model leverages input features derived from the current state of the limit order book and operates at a high frequency to maximize control. To simulate this environment and overcome the limitations associated with relying on historical data, we utilize the multi-agent market simulator ABIDES, which provides a diverse range of depth levels within the limit order book. We present a custom MDP formulation followed by the results of our methodology and benchmark the performance against standard execution strategies. Results show that the reinforcement learning agent outperforms standard strategies and offers a practical foundation for real-world trading applications.

Optimal Execution with Reinforcement Learning

TL;DR

This work tackles optimal execution of large trades within a finite horizon by learning execution policies through reinforcement learning inside a multi-agent limit order book simulator ABIDES. It formulates the problem as a finite-horizon MDP and adopts a Deep Q-Network (DQN) to learn strategies that balance implementation shortfall and market impact over a 30-minute window with 1-second control. The approach integrates strategic scheduling with tactical order placement by leveraging rich LOB features and endogenously generated market impacts, demonstrating superior performance over TWAP, Passive, and Random baselines with reduced variance. The findings suggest a practical, RL-based framework for real-world high-frequency execution that adapts to dynamic liquidity conditions and agent interactions; future work could explore more tractable simulators to broaden benchmarking.

Abstract

This study investigates the development of an optimal execution strategy through reinforcement learning, aiming to determine the most effective approach for traders to buy and sell inventory within a finite time horizon. Our proposed model leverages input features derived from the current state of the limit order book and operates at a high frequency to maximize control. To simulate this environment and overcome the limitations associated with relying on historical data, we utilize the multi-agent market simulator ABIDES, which provides a diverse range of depth levels within the limit order book. We present a custom MDP formulation followed by the results of our methodology and benchmark the performance against standard execution strategies. Results show that the reinforcement learning agent outperforms standard strategies and offers a practical foundation for real-world trading applications.

Paper Structure

This paper contains 21 sections, 7 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Representation of the Limit Order Book
  • Figure 2: Stylized representation of average permanent vs. temporary price impact (see harvey2021).
  • Figure 3: Sample paths of ask prices generated by ABIDES for different seeds.
  • Figure 4: Average episode reward for DQN agent with a environment seed equal to 10 and learning rates equal to $10^{-2}$ (pink), $10^{-3}$ (red), and $10^{-4}$ (blue).
  • Figure 5: Implementation shortfall normalized by the initial order size distribution.
  • ...and 3 more figures