Table of Contents
Fetching ...

Safe Value Functions

Pierre-François Massiani, Steve Heim, Friedrich Solowjow, Sebastian Trimpe

TL;DR

This work defines Safe Value Functions (SVFs) as value functions that are simultaneously optimal for a task and guarantee safety by staying within the viability kernel ${\mathcal{X}_V}$. It proves that there exists a finite penalty $p^\star$ on failure such that for all $p>p^\star$, the penalized value function $V_p$ is safe and remains optimal on ${\mathcal{X}_V}$, with larger penalties preserving optimality. The authors derive a zeroth-order safety condition and provide explicit formulas for $p^\star$ in both continuous and discrete time, showing how discounting $\tau$, time-to-failure $T_f$, reward shaping, and dynamics interact. They connect SVFs to Hamilton–Jacobi reachability and control barrier functions, discuss CMDP duality implications, and offer practical reward-design guidelines to achieve safe, task-relevant behavior in reinforcement learning and control settings.

Abstract

Safety constraints and optimality are important, but sometimes conflicting criteria for controllers. Although these criteria are often solved separately with different tools to maintain formal guarantees, it is also common practice in reinforcement learning to simply modify reward functions by penalizing failures, with the penalty treated as a mere heuristic. We rigorously examine the relationship of both safety and optimality to penalties, and formalize sufficient conditions for safe value functions (SVFs): value functions that are both optimal for a given task, and enforce safety constraints. We reveal this structure by examining when rewards preserve viability under optimal control, and show that there always exists a finite penalty that induces a safe value function. This penalty is not unique, but upper-unbounded: larger penalties do not harm optimality. Although it is often not possible to compute the minimum required penalty, we reveal clear structure of how the penalty, rewards, discount factor, and dynamics interact. This insight suggests practical, theory-guided heuristics to design reward functions for control problems where safety is important.

Safe Value Functions

TL;DR

This work defines Safe Value Functions (SVFs) as value functions that are simultaneously optimal for a task and guarantee safety by staying within the viability kernel . It proves that there exists a finite penalty on failure such that for all , the penalized value function is safe and remains optimal on , with larger penalties preserving optimality. The authors derive a zeroth-order safety condition and provide explicit formulas for in both continuous and discrete time, showing how discounting , time-to-failure , reward shaping, and dynamics interact. They connect SVFs to Hamilton–Jacobi reachability and control barrier functions, discuss CMDP duality implications, and offer practical reward-design guidelines to achieve safe, task-relevant behavior in reinforcement learning and control settings.

Abstract

Safety constraints and optimality are important, but sometimes conflicting criteria for controllers. Although these criteria are often solved separately with different tools to maintain formal guarantees, it is also common practice in reinforcement learning to simply modify reward functions by penalizing failures, with the penalty treated as a mere heuristic. We rigorously examine the relationship of both safety and optimality to penalties, and formalize sufficient conditions for safe value functions (SVFs): value functions that are both optimal for a given task, and enforce safety constraints. We reveal this structure by examining when rewards preserve viability under optimal control, and show that there always exists a finite penalty that induces a safe value function. This penalty is not unique, but upper-unbounded: larger penalties do not harm optimality. Although it is often not possible to compute the minimum required penalty, we reveal clear structure of how the penalty, rewards, discount factor, and dynamics interact. This insight suggests practical, theory-guided heuristics to design reward functions for control problems where safety is important.

Paper Structure

This paper contains 39 sections, 8 theorems, 50 equations, 6 figures.

Key Result

Lemma 1

For $x\in\mathcal{X}$ and $u\in\mathfrak{U}$, define the (discounted) risk$\rho$ as: where $\delta_{{\mathcal{X}_F}}$ is the Dirac distribution in continuous time, or the indicator function in discrete time. A controller $u\in\mathfrak{U}$ is safe for $x\in\mathcal{X}$ if, and only if, $\rho(x, u) = 0$.

Figures (6)

  • Figure 1: The failure set is ${\mathcal{X}_F}=\{x_2\}$, and the viability kernel ${\mathcal{X}_V}=\{x_1\}$. The system is initialized in $x_1$. Two control inputs are available: $u_1$, which keeps the system in $x_1$, and $u_2$, which transitions it to $x_2$ first and then the sink state $\sigma$ (in accordance with Remark \ref{['rmk:failure infinite time']}). The reward and cost are both $0$ for $u_1$, and $1$ for $u_2$. The transition to the sink state $\sigma$ is penalized with penalty $p\in\mathbb{R}^+$, which is equivalent to a reward of $-p$. Here, the maximum time-to-failure is ${T_f}=0$, since the agent immediately lands in ${\mathcal{X}_F}$ after leaving ${\mathcal{X}_V}$.
  • Figure 2: (Top) The penalized value function $V_p$ for different values of $p$ with $\tau = 1$. The value on the viability kernel ${\mathcal{X}_V} = [0, 1]$ is not influenced by the penalty when $p \geq p^\star$. For $p > p^\star$, the zeroth-order condition holds and $V_p$ is discontinuous. (Middle) The dependency of $p^\star$ in the discount rate $\tau$ with ${T_f} = 1$. For small values of $\tau$, the dependency $e^{\frac{{T_f}}{\tau}}$ dominates. For large values of $\tau$, the infimum of $V$ on ${\mathcal{X}_V}$ decreases due to the negative rewards in ${\mathcal{X}_V}$; $p^\star$ increases again. (Bottom) The exponential dependency of $p^\star$ in ${T_f}$, for $\tau = 1$. All plots use $L = 1$ and $v = 0.2$.
  • Figure 3: A parsimonious reward function and penalty (top), and the resulting SVF (bottom). The boundary of the viability kernel is outlined in yellow. In this setting, the positive reward is only propagated to its backwards-reachable set, which in this case is the entire viability kernel ${\mathcal{X}_V}$. Conversely, the negative penalty is only propagated along trajectories that cannot avoid failure; that is, to the complement of ${\mathcal{X}_V}$. As such, any penalty will ensure an SVF. The threshold range is $\alpha \in \left[-0.0054, 0\right]$.
  • Figure 4: A degenerate reward function and penalty (top), and the resulting SVF (bottom). This setting essentially only considers safety, with no notion of task-related optimality. The resulting SVF is strictly negative outside ${\mathcal{X}_V}$ for any positive value of $p$, and the viability kernel ${\mathcal{X}_V}$ can be easily recovered with the threshold $\alpha = 0$. This setting allows the viability kernel to be recovered with a trivial choice of problem parameters. The threshold range is $\alpha \in \left[-0.0054, 0\right]$.
  • Figure 5: A positive reward function and penalty (top), an insufficiently penalized value function (middle), and an SVF (bottom). If a reward function assigns positive rewards outside ${\mathcal{X}_V}$ (top), a larger penalty is required to recover an SVF: with the relatively small penalty of $p=1$, large portions of $\mathcal{X}\setminus {\mathcal{X}_V}$ are already marked with a low value; however, some regions still have a higher value than inside ${\mathcal{X}_V}$. It is only with a penalty of $p=111$, two orders of magnitude greater than the reward, that we recover a value function that is safe everywhere. The threshold range is $\alpha \in \left[-0.004, 0\right]$: we see that, if only positive rewards are used, the viability kernel can still be recovered with a simple threshold at $\alpha=0$.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Definition 1: Viability kernel
  • Remark 1
  • Definition 2: Safe controller
  • Lemma 1
  • Definition 3: Safe value function
  • Proposition 1
  • Theorem 1: Zeroth-order condition
  • Lemma 2: Influence of the penalty
  • Theorem 2
  • Remark 2
  • ...and 7 more