Tail-Safe Hedging: Explainable Risk-Sensitive Reinforcement Learning with a White-Box CBF--QP Safety Layer in Arbitrage-Free Markets

Jian'an Zhang

Tail-Safe Hedging: Explainable Risk-Sensitive Reinforcement Learning with a White-Box CBF--QP Safety Layer in Arbitrage-Free Markets

Jian'an Zhang

TL;DR

Tail-Safe introduces a deployable hedging framework that unifies distributional, risk-sensitive reinforcement learning with a white-box, finance-specific CBF–QP safety layer. By employing an IQN-based CVaR objective and a Tail-Coverage Controller, the method stabilizes left-tail estimation while guiding policy updates with KL-regularized PPO to guard against distribution shift. The safety layer enforces discrete-time CBF constraints, ellipsoidal no-trade bands, box/rate limits, and a sign-consistency gate, with solver telemetry enabling auditable governance. Theoretical results establish robust forward invariance, minimal-deviation projection, and CVaR/DRO connections, while experiments in arbitrage-free synthetic markets show improved tail risk without sacrificing central performance, along with rich safety telemetry for governance.

Abstract

We introduce Tail-Safe, a deployability-oriented framework for derivatives hedging that unifies distributional, risk-sensitive reinforcement learning with a white-box control-barrier-function (CBF) quadratic-program (QP) safety layer tailored to financial constraints. The learning component combines an IQN-based distributional critic with a CVaR objective (IQN--CVaR--PPO) and a Tail-Coverage Controller that regulates quantile sampling through temperature tilting and tail boosting to stabilize small-$α$ estimation. The safety component enforces discrete-time CBF inequalities together with domain-specific constraints -- ellipsoidal no-trade bands, box and rate limits, and a sign-consistency gate -- solved as a convex QP whose telemetry (active sets, tightness, rate utilization, gate scores, slack, and solver status) forms an auditable trail for governance. We provide guarantees of robust forward invariance of the safe set under bounded model mismatch, a minimal-deviation projection interpretation of the QP, a KL-to-DRO upper bound linking per-state KL regularization to worst-case CVaR, concentration and sample-complexity results for the temperature-tilted CVaR estimator, and a CVaR trust-region improvement inequality under KL limits, together with feasibility persistence under expiry-aware tightening. Empirically, in arbitrage-free, microstructure-aware synthetic markets (SSVI $\to$ Dupire $\to$ VIX with ABIDES/MockLOB execution), Tail-Safe improves left-tail risk without degrading central performance and yields zero hard-constraint violations whenever the QP is feasible with zero slack. Telemetry is mapped to governance dashboards and incident workflows to support explainability and auditability. Limitations include reliance on synthetic data and simplified execution to isolate methodological contributions.

Tail-Safe Hedging: Explainable Risk-Sensitive Reinforcement Learning with a White-Box CBF--QP Safety Layer in Arbitrage-Free Markets

TL;DR

Abstract

estimation. The safety component enforces discrete-time CBF inequalities together with domain-specific constraints -- ellipsoidal no-trade bands, box and rate limits, and a sign-consistency gate -- solved as a convex QP whose telemetry (active sets, tightness, rate utilization, gate scores, slack, and solver status) forms an auditable trail for governance. We provide guarantees of robust forward invariance of the safe set under bounded model mismatch, a minimal-deviation projection interpretation of the QP, a KL-to-DRO upper bound linking per-state KL regularization to worst-case CVaR, concentration and sample-complexity results for the temperature-tilted CVaR estimator, and a CVaR trust-region improvement inequality under KL limits, together with feasibility persistence under expiry-aware tightening. Empirically, in arbitrage-free, microstructure-aware synthetic markets (SSVI

Dupire

VIX with ABIDES/MockLOB execution), Tail-Safe improves left-tail risk without degrading central performance and yields zero hard-constraint violations whenever the QP is feasible with zero slack. Telemetry is mapped to governance dashboards and incident workflows to support explainability and auditability. Limitations include reliance on synthetic data and simplified execution to isolate methodological contributions.

Tail-Safe Hedging: Explainable Risk-Sensitive Reinforcement Learning with a White-Box CBF--QP Safety Layer in Arbitrage-Free Markets

TL;DR

Abstract

Tail-Safe Hedging: Explainable Risk-Sensitive Reinforcement Learning with a White-Box CBF--QP Safety Layer in Arbitrage-Free Markets

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (32)