Table of Contents
Fetching ...

Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

Mumuksh Tayal, Manan Tayal, Ravi Prakash

Abstract

Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.

Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

Abstract

Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.
Paper Structure (33 sections, 1 theorem, 32 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 33 sections, 1 theorem, 32 equations, 6 figures, 4 tables, 2 algorithms.

Key Result

Lemma 1

Consider a set of independent and identically distributed (i.i.d.) calibration data, denoted as $\{(X_i, Y_i)\}_{i=1}^n$, along with a new test point $(X_{\text{test}}, Y_{\text{test}})$ sampled independently from the same distribution. Define a score function $s(x, y) \in \mathbb{R}$, where higher Assuming exchangeability, the prediction set $\mathcal{C}(X_{\text{test}})$ guarantees the marginal

Figures (6)

  • Figure 1: Framework Overview. SafeFQL framework proposes a safe offline RL approach using an efficient one-step flow policy extraction.
  • Figure 2: Evaluation Results. SafeFQL achieves the lowest costs across all the evaluated environments while achieving highest reward among the frameworks with comparable costs. Some baselines with (R.S.) tag represent frameworks that are evaluated using Rejection Sampling (N=16) at evaluation time.
  • Figure 3: Action Sampling Efficiency. Generative policy–based methods (FISOR, SafeIFQL) require rejection sampling to reach high safety rates; SafeFQL achieves highest safety in the Safe Boat Navigation environment with only N=1 action sample, while other baselines require larger N.
  • Figure 4: Computation Time Analysis. Training Time (Left) and Inference Time (Right) taken by each of the three generative policy based frameworks.
  • Figure 5: Illustration of Evaluation Environments.(TopLeft): Environment depicts the Safe Boat Navigation task with 2 obstacles and a goal point in a drifting river. (Remaining): Environments are the standard Safety Gymnasium environments ji2023safety from its Safe Velocity suite.
  • ...and 1 more figures

Theorems & Definitions (2)

  • proof
  • Lemma 1: Split Conformal Prediction angelopoulos2022gentle