Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

Mumuksh Tayal; Manan Tayal; Ravi Prakash

Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

Mumuksh Tayal, Manan Tayal, Ravi Prakash

Abstract

Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.

Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

Abstract

Paper Structure (33 sections, 1 theorem, 32 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 33 sections, 1 theorem, 32 equations, 6 figures, 4 tables, 2 algorithms.

Introduction
Background and Problem Setup
Generative Policies for Offline RL
Safe Offline Reinforcement Learning
Safe Flow Q-Learning
Learning Reward and Safety Critics
Reward critics.
Safety critics.
Behavior Flow Policy and One-Step Distillation
Flow behavior teacher.
One-step student actor.
Feasibility-Gated Actor Objective
Limitations of the naive Lagrangian formulation.
Feasibility-gated objective.
In-sample action generation.
...and 18 more sections

Key Result

Lemma 1

Consider a set of independent and identically distributed (i.i.d.) calibration data, denoted as $\{(X_i, Y_i)\}_{i=1}^n$, along with a new test point $(X_{\text{test}}, Y_{\text{test}})$ sampled independently from the same distribution. Define a score function $s(x, y) \in \mathbb{R}$, where higher Assuming exchangeability, the prediction set $\mathcal{C}(X_{\text{test}})$ guarantees the marginal

Figures (6)

Figure 1: Framework Overview. SafeFQL framework proposes a safe offline RL approach using an efficient one-step flow policy extraction.
Figure 2: Evaluation Results. SafeFQL achieves the lowest costs across all the evaluated environments while achieving highest reward among the frameworks with comparable costs. Some baselines with (R.S.) tag represent frameworks that are evaluated using Rejection Sampling (N=16) at evaluation time.
Figure 3: Action Sampling Efficiency. Generative policy–based methods (FISOR, SafeIFQL) require rejection sampling to reach high safety rates; SafeFQL achieves highest safety in the Safe Boat Navigation environment with only N=1 action sample, while other baselines require larger N.
Figure 4: Computation Time Analysis. Training Time (Left) and Inference Time (Right) taken by each of the three generative policy based frameworks.
Figure 5: Illustration of Evaluation Environments.(TopLeft): Environment depicts the Safe Boat Navigation task with 2 obstacles and a goal point in a drifting river. (Remaining): Environments are the standard Safety Gymnasium environments ji2023safety from its Safe Velocity suite.
...and 1 more figures

Theorems & Definitions (2)

proof
Lemma 1: Split Conformal Prediction angelopoulos2022gentle

Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

Abstract

Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

Authors

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)