Table of Contents
Fetching ...

RL-Exec: Impact-Aware Reinforcement Learning for Opportunistic Optimal Liquidation, Outperforms TWAP and a Book-Liquidity VWAP on BTC-USD Replays

Enzo Duflot, Stanislas Robineau

TL;DR

RL-Exec tackles opportunistic sell-side liquidation in BTC-USD by integrating an impact-aware reinforcement learning framework with a replay-based environment that models transient impact, resilience, latency, and fees. A PPO agent learns from depth-20 LOB features and microstructure indicators, achieving statistically significant outperformance over TWAP and a book-liquidity VWAP proxy under a strict per-day evaluation, with gains growing with execution horizon ($H \\in \\{1800,3600,7200\\}$ s). The study provides a rigorous, reproducible workflow including deterministic inference and per-day paired statistical tests (Wilcoxon with BH-FDR and bootstrap CIs), and demonstrates robust gains across horizons despite higher within-day variability. While limited to a single asset/venue and a parametric impact model, the results suggest practical potential for improved crypto execution, with clear directions for extending to buys, multi-venue routing, rolling walk-forward analyses, and live deployments.

Abstract

We study opportunistic optimal liquidation over fixed deadlines on BTC-USD limit-order books (LOB). We present RL-Exec, a PPO agent trained on historical replays augmented with endogenous transient impact (resilience), partial fills, maker/taker fees, and latency. The policy observes depth-20 LOB features plus microstructure indicators and acts under a sell-only inventory constraint to reach a residual target. Evaluation follows a strict time split (train: Jan-2020; test: Feb-2020) and a per-day protocol: for each test day we run ten independent start times and aggregate to a single daily score, avoiding pseudo-replication. We compare the agent to (i) TWAP and (ii) a VWAP-like baseline allocating using opposite-side order-book liquidity (top-20 levels), both executed on identical timestamps and costs. Statistical inference uses one-sided Wilcoxon signed-rank tests on daily RL-baseline differences with Benjamini-Hochberg FDR correction and bootstrap confidence intervals. On the Feb-2020 test set, RL-Exec significantly outperforms both baselines and the gap increases with the execution horizon (+2-3 bps at 30 min, +7-8 bps at 60 min, +23 bps at 120 min). Code: github.com/Giafferri/RL-Exec

RL-Exec: Impact-Aware Reinforcement Learning for Opportunistic Optimal Liquidation, Outperforms TWAP and a Book-Liquidity VWAP on BTC-USD Replays

TL;DR

RL-Exec tackles opportunistic sell-side liquidation in BTC-USD by integrating an impact-aware reinforcement learning framework with a replay-based environment that models transient impact, resilience, latency, and fees. A PPO agent learns from depth-20 LOB features and microstructure indicators, achieving statistically significant outperformance over TWAP and a book-liquidity VWAP proxy under a strict per-day evaluation, with gains growing with execution horizon ( s). The study provides a rigorous, reproducible workflow including deterministic inference and per-day paired statistical tests (Wilcoxon with BH-FDR and bootstrap CIs), and demonstrates robust gains across horizons despite higher within-day variability. While limited to a single asset/venue and a parametric impact model, the results suggest practical potential for improved crypto execution, with clear directions for extending to buys, multi-venue routing, rolling walk-forward analyses, and live deployments.

Abstract

We study opportunistic optimal liquidation over fixed deadlines on BTC-USD limit-order books (LOB). We present RL-Exec, a PPO agent trained on historical replays augmented with endogenous transient impact (resilience), partial fills, maker/taker fees, and latency. The policy observes depth-20 LOB features plus microstructure indicators and acts under a sell-only inventory constraint to reach a residual target. Evaluation follows a strict time split (train: Jan-2020; test: Feb-2020) and a per-day protocol: for each test day we run ten independent start times and aggregate to a single daily score, avoiding pseudo-replication. We compare the agent to (i) TWAP and (ii) a VWAP-like baseline allocating using opposite-side order-book liquidity (top-20 levels), both executed on identical timestamps and costs. Statistical inference uses one-sided Wilcoxon signed-rank tests on daily RL-baseline differences with Benjamini-Hochberg FDR correction and bootstrap confidence intervals. On the Feb-2020 test set, RL-Exec significantly outperforms both baselines and the gap increases with the execution horizon (+2-3 bps at 30 min, +7-8 bps at 60 min, +23 bps at 120 min). Code: github.com/Giafferri/RL-Exec

Paper Structure

This paper contains 27 sections, 10 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Indicators Correlation Heatmap.
  • Figure 2: RL $-$ TWAP daily gaps (bps).
  • Figure 3: RL $-$ VWAP-like daily gaps (bps).
  • Figure 4: Cumulative RL $-$ TWAP (bps).
  • Figure 5: Cumulative RL $-$ VWAP-like (bps).
  • ...and 5 more figures