Table of Contents
Fetching ...

Safe and Compliant Cross-Market Trade Execution via Constrained RL and Zero-Knowledge Audits

Ailiya Borjigin, Cong He

TL;DR

This work presents a Safe RL framework for cross-market execution that enforces hard constraints via a runtime Shield and provides verifiable compliance through a zero-knowledge audit (zkCA). Formulated as a CMDP, the agent optimizes execution quality while respecting volume participation, price collars, and self-trading restrictions. Key contributions include the PPO-based Execution Agent, a real-time action projection Shield, and zkSNARK-based proofs of compliance, evaluated in a two-venue ABIDES simulation showing competitive IS and zero violations under stress. The results suggest that compliance by design can coexist with strong performance and market stability, offering a practical pathway to trustworthy AI in regulated trading contexts.

Abstract

We present a cross-market algorithmic trading system that balances execution quality with rigorous compliance enforcement. The architecture comprises a high-level planner, a reinforcement learning execution agent, and an independent compliance agent. We formulate trade execution as a constrained Markov decision process with hard constraints on participation limits, price bands, and self-trading avoidance. The execution agent is trained with proximal policy optimization, while a runtime action-shield projects any unsafe action into a feasible set. To support auditability without exposing proprietary signals, we add a zero-knowledge compliance audit layer that produces cryptographic proofs that all actions satisfied the constraints. We evaluate in a multi-venue, ABIDES-based simulator and compare against standard baselines (e.g., TWAP, VWAP). The learned policy reduces implementation shortfall and variance while exhibiting no observed constraint violations across stress scenarios including elevated latency, partial fills, compliance module toggling, and varying constraint limits. We report effects at the 95% confidence level using paired t-tests and examine tail risk via CVaR. We situate the work at the intersection of optimal execution, safe reinforcement learning, regulatory technology, and verifiable AI, and discuss ethical considerations, limitations (e.g., modeling assumptions and computational overhead), and paths to real-world deployment.

Safe and Compliant Cross-Market Trade Execution via Constrained RL and Zero-Knowledge Audits

TL;DR

This work presents a Safe RL framework for cross-market execution that enforces hard constraints via a runtime Shield and provides verifiable compliance through a zero-knowledge audit (zkCA). Formulated as a CMDP, the agent optimizes execution quality while respecting volume participation, price collars, and self-trading restrictions. Key contributions include the PPO-based Execution Agent, a real-time action projection Shield, and zkSNARK-based proofs of compliance, evaluated in a two-venue ABIDES simulation showing competitive IS and zero violations under stress. The results suggest that compliance by design can coexist with strong performance and market stability, offering a practical pathway to trustworthy AI in regulated trading contexts.

Abstract

We present a cross-market algorithmic trading system that balances execution quality with rigorous compliance enforcement. The architecture comprises a high-level planner, a reinforcement learning execution agent, and an independent compliance agent. We formulate trade execution as a constrained Markov decision process with hard constraints on participation limits, price bands, and self-trading avoidance. The execution agent is trained with proximal policy optimization, while a runtime action-shield projects any unsafe action into a feasible set. To support auditability without exposing proprietary signals, we add a zero-knowledge compliance audit layer that produces cryptographic proofs that all actions satisfied the constraints. We evaluate in a multi-venue, ABIDES-based simulator and compare against standard baselines (e.g., TWAP, VWAP). The learned policy reduces implementation shortfall and variance while exhibiting no observed constraint violations across stress scenarios including elevated latency, partial fills, compliance module toggling, and varying constraint limits. We report effects at the 95% confidence level using paired t-tests and examine tail risk via CVaR. We situate the work at the intersection of optimal execution, safe reinforcement learning, regulatory technology, and verifiable AI, and discuss ethical considerations, limitations (e.g., modeling assumptions and computational overhead), and paths to real-world deployment.

Paper Structure

This paper contains 14 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: System architecture. The Planner splits the order and sets high-level parameters. The RL Execution Agent decides specific order placements, which are filtered by the Shield for compliance before reaching the Market Environment (multiple exchanges and trading agents). The Compliance Agent (through the Shield and monitoring) ensures no rule is violated in real-time and generates zero-knowledge compliance proofs (zkCA) after execution for auditability.
  • Figure 2: Impact of network latency on execution shortfall (bps). RL (PPO) scales gracefully with latency and remains competitive to VWAP/TWAP, while arbitrage degrades sharply as latency increases.
  • Figure 3: Impact of volume participation limit $\alpha$ on performance. Left: The Safe RL agent’s implementation shortfall (IS) improves (becomes less negative) as $\alpha$ increases from 5% to 50%, but with diminishing returns beyond 20–30%. Right: Conceptual illustration of an unconstrained agent’s tendency to violate a given cap. At very low caps, an RL agent without enforcement would exceed the limit in most episodes; as the cap loosens, the agent naturally stays within bounds.