Reinforcement Learning in High-frequency Market Making

Yuheng Zheng; Zihan Ding

Reinforcement Learning in High-frequency Market Making

Yuheng Zheng, Zihan Ding

TL;DR

This work develops a theoretical framework for reinforcement learning in high-frequency market making by discretizing a continuous-time market-making model into a family of MDPs indexed by the sampling interval $Δ$. It reveals a fundamental tradeoff: smaller $Δ$ reduces learning error but increases the algorithmic complexity (and implied costs), and it proves that optimal policies and Nash equilibria converge as $Δ→0$ (with an $O(Δ)$ bound on value-function error in the single-agent case and analogous results in the two-player setting under a uniqueness assumption). The paper also provides a model-free Nash Q-learning algorithm to compute equilibrium policies and presents Monte Carlo simulations that validate the theoretical results and illustrate the practical implications for selecting sampling frequency. The findings extend to general discretized continuous-time MDPs and hold potential for broader high-frequency decision problems like optimal executions, offering guidance for practitioners on frequency choice and a foundation for future deep RL extensions in finance.

Abstract

This paper establishes a new and comprehensive theoretical analysis for the application of reinforcement learning (RL) in high-frequency market making. We bridge the modern RL theory and the continuous-time statistical models in high-frequency financial economics. Different with most existing literature on methodological research about developing various RL methods for market making problem, our work is a pilot to provide the theoretical analysis. We target the effects of sampling frequency, and find an interesting tradeoff between error and complexity of RL algorithm when tweaking the values of the time increment $Δ$ $-$ as $Δ$ becomes smaller, the error will be smaller but the complexity will be larger. We also study the two-player case under the general-sum game framework and establish the convergence of Nash equilibrium to the continuous-time game equilibrium as $Δ\rightarrow0$. The Nash Q-learning algorithm, which is an online multi-agent RL method, is applied to solve the equilibrium. Our theories are not only useful for practitioners to choose the sampling frequency, but also very general and applicable to other high-frequency financial decision making problems, e.g., optimal executions, as long as the time-discretization of a continuous-time markov decision process is adopted. Monte Carlo simulation evidence support all of our theories.

Reinforcement Learning in High-frequency Market Making

TL;DR

. It reveals a fundamental tradeoff: smaller

reduces learning error but increases the algorithmic complexity (and implied costs), and it proves that optimal policies and Nash equilibria converge as

(with an

bound on value-function error in the single-agent case and analogous results in the two-player setting under a uniqueness assumption). The paper also provides a model-free Nash Q-learning algorithm to compute equilibrium policies and presents Monte Carlo simulations that validate the theoretical results and illustrate the practical implications for selecting sampling frequency. The findings extend to general discretized continuous-time MDPs and hold potential for broader high-frequency decision problems like optimal executions, offering guidance for practitioners on frequency choice and a foundation for future deep RL extensions in finance.

Abstract

becomes smaller, the error will be smaller but the complexity will be larger. We also study the two-player case under the general-sum game framework and establish the convergence of Nash equilibrium to the continuous-time game equilibrium as

. The Nash Q-learning algorithm, which is an online multi-agent RL method, is applied to solve the equilibrium. Our theories are not only useful for practitioners to choose the sampling frequency, but also very general and applicable to other high-frequency financial decision making problems, e.g., optimal executions, as long as the time-discretization of a continuous-time markov decision process is adopted. Monte Carlo simulation evidence support all of our theories.

Paper Structure (26 sections, 6 theorems, 83 equations, 5 figures, 1 algorithm)

This paper contains 26 sections, 6 theorems, 83 equations, 5 figures, 1 algorithm.

Introduction
High-frequency Market Making
The state space $\mathcal{S}:=\mathcal{S}_{X}\times\mathcal{S}_{Y}$ and the state variable $S_{t}:=(X_{t},Y_{t})$.
The action space $\mathcal{A}:=\mathcal{S}_{P}\times\mathcal{S}_{P}$ and the action variable $a_{t}:=(p_{t}^{a},p_{t}^{b})$.
The transition probability kernel $\mathcal{P}$, the reward function $R$, and the value function $V_{0}^{\pi}(s)$.
Time-discretization and Convergence
Discrete-time model
Convergence of discrete-time MDP $\mathcal{M}_{\Delta}$ as $\Delta\rightarrow0$
Sample Complexity
Q-learning for single-player case
Tradeoff between learning error and sample complexity
Two-player General-sum Setting
Continuous-time model and Nash equilibrium
Time-discretization and convergence of equilibrium
Nash Q-learning algorithm for solving the equilibrium
...and 11 more sections

Key Result

Theorem 1

Under Assumption Assump_Q_rate_lambda, there exist stationary Markov policies $\pi_{0}^{*}(\cdot)$ and $\pi_{\Delta}^{*}(\cdot)$, such that the optimal value functions in the continuous-time MDP $\mathcal{M}_{0}$ and the discrete-time MDP $\mathcal{M}_{\Delta}$ are attained under $\pi_{0}^{*}(\cdot) Moreover, assuming the uniqueness of the optimal policies in $\mathcal{M}_{0}$ and $\mathcal{M}_{\D

Figures (5)

Figure 1: The tradeoff between learning error and sample complexity
Figure 2: The convergence of optimal value function $V_{\Delta}^{*}(s)$ against $\Delta$
Figure 3: The trend of sample complexity against $\Delta$
Figure 4: The convergence of equilibrium value function $V_{\Delta}^{k,\pi_{\Delta}^{1,*},\pi_{\Delta}^{2,*}}(s)$ against $\Delta$
Figure 5: $\text{MM}_{1}$ (left) and $\text{MM}_{2}$ (right) equilibrium value function learning error

Theorems & Definitions (12)

Theorem 1
Theorem 2
Definition 3
Definition 4
Theorem 5
Lemma 6
proof
Lemma 7
proof
Lemma 8
...and 2 more

Reinforcement Learning in High-frequency Market Making

TL;DR

Abstract

Reinforcement Learning in High-frequency Market Making

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (12)