Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning

Manan Tayal; Mumuksh Tayal

Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning

Manan Tayal, Mumuksh Tayal

TL;DR

This work tackles offline reinforcement learning under hard safety constraints by casting it as a state-constrained optimal control problem and introducing Epigraph-Guided Flow Matching (EpiFlow). It learns a data-driven auxiliary epigraph value function $\hat{V}(x,z)$ via expectile regression and a set of envelope functions, enabling a feasibility-aware policy that maximizes performance while staying within safe data-supported regions. Policy synthesis combines an epigraph-guided objective with Flow Matching, producing a deterministic vector-field-based sampler that implements $\pi^*(a|x) \propto \pi_\beta(a|x) \exp(\alpha\hat{A}(x,a;z^*(x)))$ and is executed through a single ODE integration. Empirically, EpiFlow achieves near-zero empirical safety violations across low- and high-dimensional safety benchmarks (e.g., Safety Gymnasium) while delivering competitive returns, outperforming several soft-constraint baselines. This approach offers a scalable, distribution-consistent safety certificate for deploying autonomous systems learned purely from offline data, with avenues for formal verification and robustness extensions.

Abstract

Offline reinforcement learning (RL) provides a compelling paradigm for training autonomous systems without the risks of online exploration, particularly in safety-critical domains. However, jointly achieving strong safety and performance from fixed datasets remains challenging. Existing safe offline RL methods often rely on soft constraints that allow violations, introduce excessive conservatism, or struggle to balance safety, reward optimization, and adherence to the data distribution. To address this, we propose Epigraph-Guided Flow Matching (EpiFlow), a framework that formulates safe offline RL as a state-constrained optimal control problem to co-optimize safety and performance. We learn a feasibility value function derived from an epigraph reformulation of the optimal control problem, thereby avoiding the decoupled objectives or post-hoc filtering common in prior work. Policies are synthesized by reweighting the behavior distribution based on this epigraph value function and fitting a generative policy via flow matching, enabling efficient, distribution-consistent sampling. Across various safety-critical tasks, including Safety-Gymnasium benchmarks, EpiFlow achieves competitive returns with near-zero empirical safety violations, demonstrating the effectiveness of epigraph-guided policy synthesis.

Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning

TL;DR

via expectile regression and a set of envelope functions, enabling a feasibility-aware policy that maximizes performance while staying within safe data-supported regions. Policy synthesis combines an epigraph-guided objective with Flow Matching, producing a deterministic vector-field-based sampler that implements

and is executed through a single ODE integration. Empirically, EpiFlow achieves near-zero empirical safety violations across low- and high-dimensional safety benchmarks (e.g., Safety Gymnasium) while delivering competitive returns, outperforming several soft-constraint baselines. This approach offers a scalable, distribution-consistent safety certificate for deploying autonomous systems learned purely from offline data, with avenues for formal verification and robustness extensions.

Abstract

Paper Structure (43 sections, 4 theorems, 59 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 43 sections, 4 theorems, 59 equations, 8 figures, 4 tables, 2 algorithms.

Introduction
Related Works
Safe Offline Reinforcement Learning.
Safe Generative Policies.
Background and Problem Setup
State-Constrained Offline Reinforcement Learning
Epigraph Reformulation
Learning the Epigraph Value Function
Recursive Characterization
Avoiding OOD actions.
Practical Loss Functions
Policy Synthesis via Epigraph-Guided Flow Matching
Experiments
Experimental Case Studies
Results
...and 28 more sections

Key Result

Theorem 4.1

Consider a deterministic SC-MDP with bounded reward and discount factor $\gamma \in (0,1)$. Given a transition $(x_t,a_t,r_t,\ell_t,x_{t+1})$ and the epigraph update $z_{t+1}=(z_t-r(x_t))/\gamma$, the epigraph value function $\hat{V}(x_t, z_t)$ satisfies the recursion

Figures (8)

Figure 1: EpiFlow Framework. We propose a framework to solve the SC-ORL problem by learning an Auxiliary Value Function, $\hat{V}(x,z)$, using an offline dataset. We formulate an epigraph form for the problem that learns $\hat{V}$, which if satisfied to stay positive, makes sure that the system stays safe while maximising on the objective. We finally learn a policy which maximises $\hat{V}$ at all time steps to find the most optimal path that the offline data supports.
Figure 2: Evaluation Results: Results plot for all the environments. Points towards left ($\leftarrow$) are more safe than those on right ($\rightarrow$). Whereas those towards top ($\uparrow$) have higher rewards than those towards bottom ($\downarrow$). Evaluated over 500 episodes and 5 seed values.
Figure 3: Boat Environment Analysis.(Top): Trajectories for all methods (baselines and ours) rollout from 2 distinct initial states are shown, where the 2 circles are the obstacles with which the agent has to avoid any collision while trying to reach the goal.
Figure 4: Auxiliary epigraph value function $\hat{V}_{\theta}(x,z)$. (Top) Learned without the decomposition regularizer, where $\hat{V}_{\theta}$ exhibits weak sensitivity to the epigraph variable $z$ and similar contour structure across different $z$ values. (Bottom) Learned with the regularizer ($\lambda=0.25$), which restores meaningful dependence on $z$ and yields distinct level sets consistent with the epigraph formulation.
Figure 5: Illustration of all evaluation environments. (Top-Left) environment illustrates Boat Navigation with obstacles and a goal point. (Middle) illustrates SafeVelocity MuJoCo environments with Red sphere denoting unsafe state. (Bottom) depicts the SafeCarNavigation MuJoCo environments with various obstacles and objectives.
...and 3 more figures

Theorems & Definitions (8)

Theorem 4.1: Epigraph Value Recursion
Theorem 5.1: Epigraph-Guided Optimal Policy
Theorem 5.2: Weighted Flow Matching Policy Recovery
Proposition 1.1: Equivalence of Epigraph Formulation
proof
proof
proof
proof

Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning

TL;DR

Abstract

Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (8)