Table of Contents
Fetching ...

LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy

Hsin-Jung Yang, Zhanhong Jiang, Prajwal Koirala, Qisai Liu, Cody Fleming, Soumik Sarkar

TL;DR

This work develops LexiSafe, a lexicographic offline RL framework designed to preserve safety-aligned behavior and extends the framework to hierarchical safety requirements with LexiSafe-MC, which supports multiple safety costs and admits its own sample-complexity analysis.

Abstract

Offline safe reinforcement learning (RL) is increasingly important for cyber-physical systems (CPS), where safety violations during training are unacceptable and only pre-collected data are available. Existing offline safe RL methods typically balance reward-safety tradeoffs through constraint relaxation or joint optimization, but they often lack structural mechanisms to prevent safety drift. We propose LexiSafe, a lexicographic offline RL framework designed to preserve safety-aligned behavior. We first develop LexiSafe-SC, a single-cost formulation for standard offline safe RL, and derive safety-violation and performance-suboptimality bounds that together yield sample-complexity guarantees. We then extend the framework to hierarchical safety requirements with LexiSafe-MC, which supports multiple safety costs and admits its own sample-complexity analysis. Empirically, LexiSafe demonstrates reduced safety violations and improved task performance compared to constrained offline baselines. By unifying lexicographic prioritization with structural bias, LexiSafe offers a practical and theoretically grounded approach for safety-critical CPS decision-making.

LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy

TL;DR

This work develops LexiSafe, a lexicographic offline RL framework designed to preserve safety-aligned behavior and extends the framework to hierarchical safety requirements with LexiSafe-MC, which supports multiple safety costs and admits its own sample-complexity analysis.

Abstract

Offline safe reinforcement learning (RL) is increasingly important for cyber-physical systems (CPS), where safety violations during training are unacceptable and only pre-collected data are available. Existing offline safe RL methods typically balance reward-safety tradeoffs through constraint relaxation or joint optimization, but they often lack structural mechanisms to prevent safety drift. We propose LexiSafe, a lexicographic offline RL framework designed to preserve safety-aligned behavior. We first develop LexiSafe-SC, a single-cost formulation for standard offline safe RL, and derive safety-violation and performance-suboptimality bounds that together yield sample-complexity guarantees. We then extend the framework to hierarchical safety requirements with LexiSafe-MC, which supports multiple safety costs and admits its own sample-complexity analysis. Empirically, LexiSafe demonstrates reduced safety violations and improved task performance compared to constrained offline baselines. By unifying lexicographic prioritization with structural bias, LexiSafe offers a practical and theoretically grounded approach for safety-critical CPS decision-making.
Paper Structure (8 sections, 5 theorems, 20 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 8 sections, 5 theorems, 20 equations, 3 figures, 3 tables, 2 algorithms.

Key Result

lemma 1

Let Assumption assumption_1 hold and $\mathcal{F}_\nu$ be a function class of neural networks with VC dimension VCdim($\mathcal{F}_\nu$). For $\hat{Q}_\nu \in\mathcal{F}_\nu$ learned by IQL via empirical Bellman backup using dataset $\mathcal{D}$, with probability at least $1-\varrho$ ($\varrho>0$),

Figures (3)

  • Figure 1: LexiSafe: The agent learns from an offline dataset $\mathcal{D}\sim\pi_\beta$ under a distributional shift constraint $D_{KL}(\pi||\pi_\beta)\leq \varepsilon$. In Stage 1 (phases marked in yellow boxes), the actor network is trained to minimize cumulative costs under constraints with safety hierarchy. In Stage 2 (the last phase), the model is retrained to maximize reward. This enforces a lexicographic policy update, preserving safety while optimizing performance. Please see Definition \ref{['definition_3']} for the formula in the Figure.
  • Figure 2: Ablation study showing LexiSafe's adherence to sequential lexicographical optimization. For both hierarchy orders, LexiSafe proceeds through the intended phases: first minimizing the primary cost, then the secondary, and finally improving reward while maintaining satisfied constraints.
  • Figure 3: Comparison of LexiSafe-MC and weighted IQL across different crash-weight values with MetaDrive. LexiSafe-MC satisfies safety constraints while maintaining high reward by explicitly enforcing the user-specified priority order through sequential lexicographic optimization. In contrast, weighted IQL struggles to satisfy constraints reliably using the traditional weighted-sum strategy, highlighting both the practical tuning challenges and the limitations of flat weighting approaches. The brackets in the legend represents ($w_{crash}$,$w_{speed}$)

Theorems & Definitions (8)

  • definition 1
  • definition 2
  • lemma 1
  • theorem 1
  • theorem 2
  • theorem 3
  • definition 3
  • corollary 1