Table of Contents
Fetching ...

ABIDES-MARL: A Multi-Agent Reinforcement Learning Environment for Endogenous Price Formation and Execution in a Limit Order Book

Patrick Cheridito, Jean-Loup Dupret, Zhexin Wu

TL;DR

ABIDES-MARL introduces a MARL-enabled, end-to-end limit-order-book simulator that decouples kernel interruption from state collection, enabling synchronized learning among multiple adaptive agents. By embedding an informed trader, a liquidity trader, and competing market makers within a Kyle-style multi-period game, the framework demonstrates how equilibrium-like price formation and endogenous liquidity emerge from strategic interaction, validated by recovering the Kyle model and analyzing execution against learned opponents. The study systematically compares linear and nonlinear policy classes, revealing that linear policies support robust, Moore-like price discovery while nonlinear policies can empower the informed trader and destabilize convergence, especially with richer market competition. The work provides a reproducible, extensible platform for analyzing equilibrium behavior in realistic markets and paves the way for integrating agentic AI with econometric market microstructure models.

Abstract

We present ABIDES-MARL, a framework that combines a new multi-agent reinforcement learning (MARL) methodology with a new realistic limit-order-book (LOB) simulation system to study equilibrium behavior in complex financial market games. The system extends ABIDES-Gym by decoupling state collection from kernel interruption, enabling synchronized learning and decision-making for multiple adaptive agents while maintaining compatibility with standard RL libraries. It preserves key market features such as price-time priority and discrete tick sizes. Methodologically, we use MARL to approximate equilibrium-like behavior in multi-period trading games with a finite number of heterogeneous agents-an informed trader, a liquidity trader, noise traders, and competing market makers-all with individual price impacts. This setting bridges optimal execution and market microstructure by embedding the liquidity trader's optimization problem within a strategic trading environment. We validate the approach by solving an extended Kyle model within the simulation system, recovering the gradual price discovery phenomenon. We then extend the analysis to a liquidity trader's problem where market liquidity arises endogenously and show that, at equilibrium, execution strategies shape market-maker behavior and price dynamics. ABIDES-MARL provides a reproducible foundation for analyzing equilibrium and strategic adaptation in realistic markets and contributes toward building economically interpretable agentic AI systems for finance.

ABIDES-MARL: A Multi-Agent Reinforcement Learning Environment for Endogenous Price Formation and Execution in a Limit Order Book

TL;DR

ABIDES-MARL introduces a MARL-enabled, end-to-end limit-order-book simulator that decouples kernel interruption from state collection, enabling synchronized learning among multiple adaptive agents. By embedding an informed trader, a liquidity trader, and competing market makers within a Kyle-style multi-period game, the framework demonstrates how equilibrium-like price formation and endogenous liquidity emerge from strategic interaction, validated by recovering the Kyle model and analyzing execution against learned opponents. The study systematically compares linear and nonlinear policy classes, revealing that linear policies support robust, Moore-like price discovery while nonlinear policies can empower the informed trader and destabilize convergence, especially with richer market competition. The work provides a reproducible, extensible platform for analyzing equilibrium behavior in realistic markets and paves the way for integrating agentic AI with econometric market microstructure models.

Abstract

We present ABIDES-MARL, a framework that combines a new multi-agent reinforcement learning (MARL) methodology with a new realistic limit-order-book (LOB) simulation system to study equilibrium behavior in complex financial market games. The system extends ABIDES-Gym by decoupling state collection from kernel interruption, enabling synchronized learning and decision-making for multiple adaptive agents while maintaining compatibility with standard RL libraries. It preserves key market features such as price-time priority and discrete tick sizes. Methodologically, we use MARL to approximate equilibrium-like behavior in multi-period trading games with a finite number of heterogeneous agents-an informed trader, a liquidity trader, noise traders, and competing market makers-all with individual price impacts. This setting bridges optimal execution and market microstructure by embedding the liquidity trader's optimization problem within a strategic trading environment. We validate the approach by solving an extended Kyle model within the simulation system, recovering the gradual price discovery phenomenon. We then extend the analysis to a liquidity trader's problem where market liquidity arises endogenously and show that, at equilibrium, execution strategies shape market-maker behavior and price dynamics. ABIDES-MARL provides a reproducible foundation for analyzing equilibrium and strategic adaptation in realistic markets and contributes toward building economically interpretable agentic AI systems for finance.

Paper Structure

This paper contains 50 sections, 3 theorems, 46 equations, 6 figures, 3 tables.

Key Result

Lemma 3.2

The reward function in Eqn. eq:seq-kyle-reward-mm implies zero aggregate profit across all market makers.

Figures (6)

  • Figure 1: ABIDES-MARL communication cycle.
  • Figure 2: Linear policy parameterization with 20 market makers and no limit-order-book (LOB) information revealed. Each panel shows the evolution of transaction prices when the opening price was initialized below (a) or above (b) the fundamental value. The fundamental value is marked by the red dashed line. Across evaluation episodes, prices gradually converged toward the fundamental value, indicating that even with limited observability, market participants collectively recovered informational efficiency.
  • Figure 3: Nonlinear policy parameterization with 20 market makers and no limit-order-book (LOB) information revealed. Panels (a) and (b) correspond to opening prices initialized below or above the fundamental value, respectively. The fundamental value is marked by the red dashed line. With the more flexible nonlinear policy network, the informed trader dominated the market dynamics, persistently driving transaction prices toward the lower boundary of the admissible price range regardless of the initial condition.
  • Figure 4: Price and unfilled-inventory dynamics for the liquidity trader trading against 20 market makers under linear policy parameterization, with no limit-order-book (LOB) information revealed. The liquidity trader’s risk aversion is set to $\phi = 0.01$. Panels (a) and (b) correspond to opening prices initialized below and above the fundamental value, respectively. In both cases, market makers rapidly adjust their quotes upward, limiting the trader’s ability to reduce transaction costs. This behavior reflects the market makers’ adaptation to the trader’s predictable acquisition strategy, resulting in consistently adverse price movements.
  • Figure 5: Nonlinear policy parameterization with 20 market makers and no limit-order-book (LOB) information revealed. The liquidity trader’s risk aversion is set to $\phi = 0.01$, and the opening price is initialized below the fundamental value. Each panel reports the price trajectory (left) and unfilled inventory process (right) averaged over 30 evaluation episodes. Among the five strategies, the PPO policy achieves the lowest implementation shortfall by waiting for prices to reach the floor level before initiating acquisition. In contrast, the Analytical strategy performs the worst, as it is the only one that induces an upward price movement during execution.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Definition 1: LOB
  • Claim 3.1: Unanimous VWAP
  • proof
  • Lemma 3.2: Zero Profit Among Market Makers
  • proof
  • Theorem 3.3
  • proof
  • Theorem 5.1: Recursive Linear Equilibrium of the Kyle Model