A Survey of Constraint Formulations in Safe Reinforcement Learning

Akifumi Wachi; Xun Shen; Yanan Sui

A Survey of Constraint Formulations in Safe Reinforcement Learning

Akifumi Wachi, Xun Shen, Yanan Sui

TL;DR

This survey addresses the fragmentation in safe RL by focusing on constraint formulations within the constrained Markov decision process (CMDP) framework, formalizing the problem as max $V_r^\pi(\rho)$ subject to a constraint $f_{\mathcal{C}}(\pi)\le 0$ and cataloguing seven representative constraint representations. It introduces three theoretical notions—transformability, generalizability, and conservative approximation—and shows that many formulations are IoMG-SafeRL variants of others, e.g., an instantaneous or per-step constraint can be related to a cumulative, long-horizon constraint via budget transformations $\eta_h$, with key results including that Problem 3.4 is IoMG-SafeRL over (3.1,3.2) and Problem 3.7 is IoMG-SafeRL over (3.5,3.6); furthermore, a gamma-corrected instantaneous formulation can conservatively approximate joint chance constraints. The paper provides a curated map of representative algorithms aligned to each formulation, discusses practical considerations for algorithm choice and safety guarantees during training versus post-convergence, and highlights online versus offline safe RL as a practical axis for deployment. Overall, the work offers a systematic understanding of constraint formulations, guides formulation- and algorithm-selection for real-world safety-critical RL, and sketches avenues for extending safe RL beyond the standard cumulative-additive paradigms.

Abstract

Safety is critical when applying reinforcement learning (RL) to real-world problems. As a result, safe RL has emerged as a fundamental and powerful paradigm for optimizing an agent's policy while incorporating notions of safety. A prevalent safe RL approach is based on a constrained criterion, which seeks to maximize the expected cumulative reward subject to specific safety constraints. Despite recent effort to enhance safety in RL, a systematic understanding of the field remains difficult. This challenge stems from the diversity of constraint representations and little exploration of their interrelations. To bridge this knowledge gap, we present a comprehensive review of representative constraint formulations, along with a curated selection of algorithms designed specifically for each formulation. In addition, we elucidate the theoretical underpinnings that reveal the mathematical mutual relations among common problem formulations. We conclude with a discussion of the current state and future directions of safe reinforcement learning research.

A Survey of Constraint Formulations in Safe Reinforcement Learning

TL;DR

This survey addresses the fragmentation in safe RL by focusing on constraint formulations within the constrained Markov decision process (CMDP) framework, formalizing the problem as max

subject to a constraint

and cataloguing seven representative constraint representations. It introduces three theoretical notions—transformability, generalizability, and conservative approximation—and shows that many formulations are IoMG-SafeRL variants of others, e.g., an instantaneous or per-step constraint can be related to a cumulative, long-horizon constraint via budget transformations

, with key results including that Problem 3.4 is IoMG-SafeRL over (3.1,3.2) and Problem 3.7 is IoMG-SafeRL over (3.5,3.6); furthermore, a gamma-corrected instantaneous formulation can conservatively approximate joint chance constraints. The paper provides a curated map of representative algorithms aligned to each formulation, discusses practical considerations for algorithm choice and safety guarantees during training versus post-convergence, and highlights online versus offline safe RL as a practical axis for deployment. Overall, the work offers a systematic understanding of constraint formulations, guides formulation- and algorithm-selection for real-world safety-critical RL, and sketches avenues for extending safe RL beyond the standard cumulative-additive paradigms.

Abstract

Paper Structure (23 sections, 6 theorems, 35 equations, 3 figures, 1 table)

This paper contains 23 sections, 6 theorems, 35 equations, 3 figures, 1 table.

Introduction
Our contributions.
Preliminaries
Common Constraint Formulations
Expected Cumulative Safety Constraint
State Constraint
Joint Chance Constraint
Expected Instantaneous Safety Constraint with Time-variant Threshold
Almost Surely Cumulative Safety Constraint
Almost Surely Instantaneous Safety Constraint with Time-invariant Threshold
Almost Surely Instantaneous Safety Constraint with Time-variant Threshold
Other Constrained Formulations
Theoretical Relations Among Common Constraint Formulations of Safe RL
Definitions
Preliminary Lemmas
...and 8 more sections

Key Result

Lemma 1

Define a new variable $\eta_h$ meaning the remaining safety budget associated with the discount factor $\gamma_c$ such that Then, the following relation between additive and instantaneous constraints holds:

Figures (3)

Figure 1: A typical sequence for solving safe RL problems based on constrained criteria. Due to the diversity of safety constraint representations and little discussion on their interrelations, it is not easy to understand safe RL research systematically. Unlike existing survey papers that focus on methods, we aim to provide a comprehensive survey from the perspective of formulations on safe RL.
Figure 2: Relations among common safe RL formulations based on $\mathbb{E}_\pi$ and the one with chance constraints.
Figure 3: Relations among common safe RL formulations based on $\mathbb{P}_\pi$ (i.e., almost-surely constraints).

Theorems & Definitions (22)

Remark 1
Remark 2
Remark 3
Remark 4
Remark 5
Remark 6
Remark 7
Definition 1: Transformability
Definition 2: Generalizability
Definition 3: Conservative Approximation
...and 12 more

A Survey of Constraint Formulations in Safe Reinforcement Learning

TL;DR

Abstract

A Survey of Constraint Formulations in Safe Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (22)