Table of Contents
Fetching ...

Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning

Manan Tayal, Mumuksh Tayal

TL;DR

This work tackles offline reinforcement learning under hard safety constraints by casting it as a state-constrained optimal control problem and introducing Epigraph-Guided Flow Matching (EpiFlow). It learns a data-driven auxiliary epigraph value function $\hat{V}(x,z)$ via expectile regression and a set of envelope functions, enabling a feasibility-aware policy that maximizes performance while staying within safe data-supported regions. Policy synthesis combines an epigraph-guided objective with Flow Matching, producing a deterministic vector-field-based sampler that implements $\pi^*(a|x) \propto \pi_\beta(a|x) \exp(\alpha\hat{A}(x,a;z^*(x)))$ and is executed through a single ODE integration. Empirically, EpiFlow achieves near-zero empirical safety violations across low- and high-dimensional safety benchmarks (e.g., Safety Gymnasium) while delivering competitive returns, outperforming several soft-constraint baselines. This approach offers a scalable, distribution-consistent safety certificate for deploying autonomous systems learned purely from offline data, with avenues for formal verification and robustness extensions.

Abstract

Offline reinforcement learning (RL) provides a compelling paradigm for training autonomous systems without the risks of online exploration, particularly in safety-critical domains. However, jointly achieving strong safety and performance from fixed datasets remains challenging. Existing safe offline RL methods often rely on soft constraints that allow violations, introduce excessive conservatism, or struggle to balance safety, reward optimization, and adherence to the data distribution. To address this, we propose Epigraph-Guided Flow Matching (EpiFlow), a framework that formulates safe offline RL as a state-constrained optimal control problem to co-optimize safety and performance. We learn a feasibility value function derived from an epigraph reformulation of the optimal control problem, thereby avoiding the decoupled objectives or post-hoc filtering common in prior work. Policies are synthesized by reweighting the behavior distribution based on this epigraph value function and fitting a generative policy via flow matching, enabling efficient, distribution-consistent sampling. Across various safety-critical tasks, including Safety-Gymnasium benchmarks, EpiFlow achieves competitive returns with near-zero empirical safety violations, demonstrating the effectiveness of epigraph-guided policy synthesis.

Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning

TL;DR

This work tackles offline reinforcement learning under hard safety constraints by casting it as a state-constrained optimal control problem and introducing Epigraph-Guided Flow Matching (EpiFlow). It learns a data-driven auxiliary epigraph value function via expectile regression and a set of envelope functions, enabling a feasibility-aware policy that maximizes performance while staying within safe data-supported regions. Policy synthesis combines an epigraph-guided objective with Flow Matching, producing a deterministic vector-field-based sampler that implements and is executed through a single ODE integration. Empirically, EpiFlow achieves near-zero empirical safety violations across low- and high-dimensional safety benchmarks (e.g., Safety Gymnasium) while delivering competitive returns, outperforming several soft-constraint baselines. This approach offers a scalable, distribution-consistent safety certificate for deploying autonomous systems learned purely from offline data, with avenues for formal verification and robustness extensions.

Abstract

Offline reinforcement learning (RL) provides a compelling paradigm for training autonomous systems without the risks of online exploration, particularly in safety-critical domains. However, jointly achieving strong safety and performance from fixed datasets remains challenging. Existing safe offline RL methods often rely on soft constraints that allow violations, introduce excessive conservatism, or struggle to balance safety, reward optimization, and adherence to the data distribution. To address this, we propose Epigraph-Guided Flow Matching (EpiFlow), a framework that formulates safe offline RL as a state-constrained optimal control problem to co-optimize safety and performance. We learn a feasibility value function derived from an epigraph reformulation of the optimal control problem, thereby avoiding the decoupled objectives or post-hoc filtering common in prior work. Policies are synthesized by reweighting the behavior distribution based on this epigraph value function and fitting a generative policy via flow matching, enabling efficient, distribution-consistent sampling. Across various safety-critical tasks, including Safety-Gymnasium benchmarks, EpiFlow achieves competitive returns with near-zero empirical safety violations, demonstrating the effectiveness of epigraph-guided policy synthesis.
Paper Structure (43 sections, 4 theorems, 59 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 43 sections, 4 theorems, 59 equations, 8 figures, 4 tables, 2 algorithms.

Key Result

Theorem 4.1

Consider a deterministic SC-MDP with bounded reward and discount factor $\gamma \in (0,1)$. Given a transition $(x_t,a_t,r_t,\ell_t,x_{t+1})$ and the epigraph update $z_{t+1}=(z_t-r(x_t))/\gamma$, the epigraph value function $\hat{V}(x_t, z_t)$ satisfies the recursion

Figures (8)

  • Figure 1: EpiFlow Framework. We propose a framework to solve the SC-ORL problem by learning an Auxiliary Value Function, $\hat{V}(x,z)$, using an offline dataset. We formulate an epigraph form for the problem that learns $\hat{V}$, which if satisfied to stay positive, makes sure that the system stays safe while maximising on the objective. We finally learn a policy which maximises $\hat{V}$ at all time steps to find the most optimal path that the offline data supports.
  • Figure 2: Evaluation Results: Results plot for all the environments. Points towards left ($\leftarrow$) are more safe than those on right ($\rightarrow$). Whereas those towards top ($\uparrow$) have higher rewards than those towards bottom ($\downarrow$). Evaluated over 500 episodes and 5 seed values.
  • Figure 3: Boat Environment Analysis.(Top): Trajectories for all methods (baselines and ours) rollout from 2 distinct initial states are shown, where the 2 circles are the obstacles with which the agent has to avoid any collision while trying to reach the goal.
  • Figure 4: Auxiliary epigraph value function $\hat{V}_{\theta}(x,z)$. (Top) Learned without the decomposition regularizer, where $\hat{V}_{\theta}$ exhibits weak sensitivity to the epigraph variable $z$ and similar contour structure across different $z$ values. (Bottom) Learned with the regularizer ($\lambda=0.25$), which restores meaningful dependence on $z$ and yields distinct level sets consistent with the epigraph formulation.
  • Figure 5: Illustration of all evaluation environments. (Top-Left) environment illustrates Boat Navigation with obstacles and a goal point. (Middle) illustrates SafeVelocity MuJoCo environments with Red sphere denoting unsafe state. (Bottom) depicts the SafeCarNavigation MuJoCo environments with various obstacles and objectives.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Theorem 4.1: Epigraph Value Recursion
  • Theorem 5.1: Epigraph-Guided Optimal Policy
  • Theorem 5.2: Weighted Flow Matching Policy Recovery
  • Proposition 1.1: Equivalence of Epigraph Formulation
  • proof
  • proof
  • proof
  • proof