Table of Contents
Fetching ...

Tail-Safe Hedging: Explainable Risk-Sensitive Reinforcement Learning with a White-Box CBF--QP Safety Layer in Arbitrage-Free Markets

Jian'an Zhang

TL;DR

Tail-Safe introduces a deployable hedging framework that unifies distributional, risk-sensitive reinforcement learning with a white-box, finance-specific CBF–QP safety layer. By employing an IQN-based CVaR objective and a Tail-Coverage Controller, the method stabilizes left-tail estimation while guiding policy updates with KL-regularized PPO to guard against distribution shift. The safety layer enforces discrete-time CBF constraints, ellipsoidal no-trade bands, box/rate limits, and a sign-consistency gate, with solver telemetry enabling auditable governance. Theoretical results establish robust forward invariance, minimal-deviation projection, and CVaR/DRO connections, while experiments in arbitrage-free synthetic markets show improved tail risk without sacrificing central performance, along with rich safety telemetry for governance.

Abstract

We introduce Tail-Safe, a deployability-oriented framework for derivatives hedging that unifies distributional, risk-sensitive reinforcement learning with a white-box control-barrier-function (CBF) quadratic-program (QP) safety layer tailored to financial constraints. The learning component combines an IQN-based distributional critic with a CVaR objective (IQN--CVaR--PPO) and a Tail-Coverage Controller that regulates quantile sampling through temperature tilting and tail boosting to stabilize small-$α$ estimation. The safety component enforces discrete-time CBF inequalities together with domain-specific constraints -- ellipsoidal no-trade bands, box and rate limits, and a sign-consistency gate -- solved as a convex QP whose telemetry (active sets, tightness, rate utilization, gate scores, slack, and solver status) forms an auditable trail for governance. We provide guarantees of robust forward invariance of the safe set under bounded model mismatch, a minimal-deviation projection interpretation of the QP, a KL-to-DRO upper bound linking per-state KL regularization to worst-case CVaR, concentration and sample-complexity results for the temperature-tilted CVaR estimator, and a CVaR trust-region improvement inequality under KL limits, together with feasibility persistence under expiry-aware tightening. Empirically, in arbitrage-free, microstructure-aware synthetic markets (SSVI $\to$ Dupire $\to$ VIX with ABIDES/MockLOB execution), Tail-Safe improves left-tail risk without degrading central performance and yields zero hard-constraint violations whenever the QP is feasible with zero slack. Telemetry is mapped to governance dashboards and incident workflows to support explainability and auditability. Limitations include reliance on synthetic data and simplified execution to isolate methodological contributions.

Tail-Safe Hedging: Explainable Risk-Sensitive Reinforcement Learning with a White-Box CBF--QP Safety Layer in Arbitrage-Free Markets

TL;DR

Tail-Safe introduces a deployable hedging framework that unifies distributional, risk-sensitive reinforcement learning with a white-box, finance-specific CBF–QP safety layer. By employing an IQN-based CVaR objective and a Tail-Coverage Controller, the method stabilizes left-tail estimation while guiding policy updates with KL-regularized PPO to guard against distribution shift. The safety layer enforces discrete-time CBF constraints, ellipsoidal no-trade bands, box/rate limits, and a sign-consistency gate, with solver telemetry enabling auditable governance. Theoretical results establish robust forward invariance, minimal-deviation projection, and CVaR/DRO connections, while experiments in arbitrage-free synthetic markets show improved tail risk without sacrificing central performance, along with rich safety telemetry for governance.

Abstract

We introduce Tail-Safe, a deployability-oriented framework for derivatives hedging that unifies distributional, risk-sensitive reinforcement learning with a white-box control-barrier-function (CBF) quadratic-program (QP) safety layer tailored to financial constraints. The learning component combines an IQN-based distributional critic with a CVaR objective (IQN--CVaR--PPO) and a Tail-Coverage Controller that regulates quantile sampling through temperature tilting and tail boosting to stabilize small- estimation. The safety component enforces discrete-time CBF inequalities together with domain-specific constraints -- ellipsoidal no-trade bands, box and rate limits, and a sign-consistency gate -- solved as a convex QP whose telemetry (active sets, tightness, rate utilization, gate scores, slack, and solver status) forms an auditable trail for governance. We provide guarantees of robust forward invariance of the safe set under bounded model mismatch, a minimal-deviation projection interpretation of the QP, a KL-to-DRO upper bound linking per-state KL regularization to worst-case CVaR, concentration and sample-complexity results for the temperature-tilted CVaR estimator, and a CVaR trust-region improvement inequality under KL limits, together with feasibility persistence under expiry-aware tightening. Empirically, in arbitrage-free, microstructure-aware synthetic markets (SSVI Dupire VIX with ABIDES/MockLOB execution), Tail-Safe improves left-tail risk without degrading central performance and yields zero hard-constraint violations whenever the QP is feasible with zero slack. Telemetry is mapped to governance dashboards and incident workflows to support explainability and auditability. Limitations include reliance on synthetic data and simplified execution to isolate methodological contributions.

Paper Structure

This paper contains 135 sections, 16 theorems, 127 equations, 5 figures, 7 tables, 2 algorithms.

Key Result

Theorem 1

Under Assumptions assump:dynamics--assump:cbf, if $h_i(x_t)\ge 0$ for all $i$, then $h_i(x_{t+1})\ge 0$ for all $i$. Hence the safe set $\mathcal{C}:=\{x:\,h_i(x)\ge 0,\,\forall i\}$ is forward invariant.

Figures (5)

  • Figure 1: Tail-Safe overview.(a) Market & Execution: SSVI $\rightarrow$ Dupire $\rightarrow$ 30D VIX; ABIDES/MockLOB execution with temporary and transient impact. (b) IQN--CVaR--PPO: distributional critic via quantile regression, PPO with KL and entropy regularization. (c) White-box CBF--QP safety layer: discrete-time CBF, ellipsoidal NTB, box/rate limits, and a sign-consistency gate solved as a convex QP. (d) Telemetry & risk metrics: $\mathrm{VaR}/\mathrm{ES}$, policy KL, tail coverage, active-set, tightest-constraint ID, rate utilization, gate score, and slack.
  • Figure 2: PnL ECDF (overlay).Tail-Safe vs. QP-only baseline on matched scenarios and unified sample sizes. Shaded bands (if present) indicate $95\%$ paired bootstrap CIs at each quantile (Appendix B).
  • Figure 3: PnL distribution (overlay). Negative tails thin out under Tail-Safe, while the bulk remains comparable. Vertical lines annotate $\operatorname{VaR}_\alpha$ and $\operatorname{ES}_\alpha$.
  • Figure 4: QP-only Baseline. Unified samples; tail annotations correspond to $\alpha$ used at evaluation.
  • Figure 5: Tail-Safe. Left-tail improvement with central mass preserved

Theorems & Definitions (32)

  • Theorem 1: Robust forward invariance
  • Proposition 1: Shifted projection
  • Theorem 2: Donsker--Varadhan bound and per-state KL conservatism
  • Theorem 3: Concentration of the temperature-tilted CVaR estimator
  • Theorem 4: CVaR trust-region improvement
  • Theorem 5: Persistence via NTB shrinkage and rate tightening
  • Proposition 2: Gate-induced lower bound
  • Lemma 1: Lipschitz pushforward under bounded disturbance
  • proof
  • proof : Proof of Theorem \ref{['thm:invariance']}
  • ...and 22 more