Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning

Janaka Chathuranga Brahmanage; Akshat Kumar

Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning

Janaka Chathuranga Brahmanage, Akshat Kumar

Abstract

Sequential decision making using Markov Decision Process underpins many realworld applications. Both model-based and model free methods have achieved strong results in these settings. However, real-world tasks must balance reward maximization with safety constraints, often conflicting objectives, that can lead to unstable min/max, adversarial optimization. A promising alternative is safety reachability analysis, which precomputes a forward-invariant safe state, action set, ensuring that an agent starting inside this set remains safe indefinitely. Yet, most reachability based methods address only hard safety constraints, and little work extends reachability to cumulative cost constraints. To address this, first, we define a safetyconditioned reachability set that decouples reward maximization from cumulative safety cost constraints. Second, we show how this set enforces safety constraints without unstable min/max or Lagrangian optimization, yielding a novel offline safe RL algorithm that learns a safe policy from a fixed dataset without environment interaction. Finally, experiments on standard offline safe RL benchmarks, and a real world maritime navigation task demonstrate that our method matches or outperforms state of the art baselines while maintaining safety.

Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning

Abstract

Paper Structure (58 sections, 5 theorems, 57 equations, 8 figures, 10 tables, 3 algorithms)

This paper contains 58 sections, 5 theorems, 57 equations, 8 figures, 10 tables, 3 algorithms.

Introduction
Approach overview
Related work
Adaptive Safty Budgets for RL
Preliminaries
Constrained Markov Decision Process
Learning from offline data
In-Sample Q-learning Algorithms
Persistent Safety with Value Functions
Budget-Conditioned Reachability
A Budget-Conditioned Persistent Safety Set
Budget Adaptive MDPs
The objective of the BAMDP
Feasible state space and $\boldsymbol{\Pi_P}$
Augmented State Space:
...and 43 more sections

Key Result

Lemma 3.3

Given $\delta \in {\mathbb{R}}^+$; for any state $s \in S_P(\delta)$, the budget-conditioned persistent safe action set $A_P(s,\delta)$ is non-empty: $A_P(s,\delta) \neq \emptyset.$

Figures (8)

Figure 1: Electronic navigation chart for maritime traffic navigation in Singapore strait
Figure 2: State augmentation with budget. A desirable property is set $\bar{S}_P$ to be persistent (trajectories starting in $\bar{S}_P$ must end in $\bar{S}_P$)
Figure 2: Performance metrics in the Maritime Navigation Task (section \ref{['exp:real']})
Figure 3: Comparison with Optimal Solution: Grid-world results with X-axis as the intended-movement probability $p$ (higher values indicate less noise). Top plots show total return; bottom plots show cost for two budget levels.
Figure 4: Expert, learned trajectories in marine navigation
...and 3 more figures

Theorems & Definitions (16)

Definition 3.1: The optimal cost-value function
Definition 3.2: Budget-Conditioned Persistent Safety Sets
Lemma 3.3: Safe Actions Always Exist for Persistent Safety States
proof
Definition 3.4: Budget-Adaptive MDP
Definition 3.5: Budget-Restricted Policy Set
Definition 3.6: Feasible State Subspace
Definition 3.7: Soft Budget-Tracking
Theorem 3.8: Properties of Policies in $\Pi_P$ under Soft Budget-Tracking
Definition A.1: Direct Budget-Tracking
...and 6 more

Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning

Abstract

Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning

Authors

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (16)