Table of Contents
Fetching ...

Q-learning-based Model-free Safety Filter

Guo Ning Sue, Yogita Choudhary, Richard Desatnik, Carmel Majidi, John Dolan, Guanya Shi

TL;DR

This work tackles safety in real-world robotics with unknown dynamics by introducing a model-free safety filter learned via Q-learning. A novel reward design $r_{\text{safe}}$ partitions the state space into $\mathcal{X}_{\text{safe}}$, $\mathcal{X}_{\text{irrec}}$, and $\mathcal{X}_{\text{unsafe}}$, enabling a safety value function $V_{\text{safe}}^*$ and Q-function $Q_{\text{safe}}^*$ to guide action filtering through a threshold $\epsilon_2$. The method supports simultaneous, off-policy training of a task policy and a safety policy with separate replay buffers, and includes an implementation that uses SAC for simulations and DQN for hardware. Empirical validation on a double integrator, Dubin's car, and a soft robotic limb demonstrates the framework’s effectiveness, robustness to suboptimal training, and ability to generalize to different task policies. The approach offers a practical, plug-and-play safety mechanism for model-free RL in complex robotic systems, with tunable conservatism via $\epsilon_2$ and potential for broader applicability beyond the tested domains.

Abstract

Ensuring safety via safety filters in real-world robotics presents significant challenges, particularly when the system dynamics is complex or unavailable. To handle this issue, learning-based safety filters recently gained popularity, which can be classified as model-based and model-free methods. Existing model-based approaches requires various assumptions on system model (e.g., control-affine), which limits their application in complex systems, and existing model-free approaches need substantial modifications to standard RL algorithms and lack versatility. This paper proposes a simple, plugin-and-play, and effective model-free safety filter learning framework. We introduce a novel reward formulation and use Q-learning to learn Q-value functions to safeguard arbitrary task specific nominal policies via filtering out their potentially unsafe actions. The threshold used in the filtering process is supported by our theoretical analysis. Due to its model-free nature and simplicity, our framework can be seamlessly integrated with various RL algorithms. We validate the proposed approach through simulations on double integrator and Dubin's car systems and demonstrate its effectiveness in real-world experiments with a soft robotic limb.

Q-learning-based Model-free Safety Filter

TL;DR

This work tackles safety in real-world robotics with unknown dynamics by introducing a model-free safety filter learned via Q-learning. A novel reward design partitions the state space into , , and , enabling a safety value function and Q-function to guide action filtering through a threshold . The method supports simultaneous, off-policy training of a task policy and a safety policy with separate replay buffers, and includes an implementation that uses SAC for simulations and DQN for hardware. Empirical validation on a double integrator, Dubin's car, and a soft robotic limb demonstrates the framework’s effectiveness, robustness to suboptimal training, and ability to generalize to different task policies. The approach offers a practical, plug-and-play safety mechanism for model-free RL in complex robotic systems, with tunable conservatism via and potential for broader applicability beyond the tested domains.

Abstract

Ensuring safety via safety filters in real-world robotics presents significant challenges, particularly when the system dynamics is complex or unavailable. To handle this issue, learning-based safety filters recently gained popularity, which can be classified as model-based and model-free methods. Existing model-based approaches requires various assumptions on system model (e.g., control-affine), which limits their application in complex systems, and existing model-free approaches need substantial modifications to standard RL algorithms and lack versatility. This paper proposes a simple, plugin-and-play, and effective model-free safety filter learning framework. We introduce a novel reward formulation and use Q-learning to learn Q-value functions to safeguard arbitrary task specific nominal policies via filtering out their potentially unsafe actions. The threshold used in the filtering process is supported by our theoretical analysis. Due to its model-free nature and simplicity, our framework can be seamlessly integrated with various RL algorithms. We validate the proposed approach through simulations on double integrator and Dubin's car systems and demonstrate its effectiveness in real-world experiments with a soft robotic limb.

Paper Structure

This paper contains 23 sections, 1 theorem, 19 equations, 6 figures, 2 tables.

Key Result

Theorem 1

Given $Q^*_{safe}$ and if $x \in \mathcal{X}_{safe}$ and $\pi_{filter}(x)$ is followed, then $x' \in \mathcal{X}_{safe}$ where $x'$ is the state of the system after taking $\pi_{filter}(x)$ from $x$.

Figures (6)

  • Figure 1: The block diagram shows our model-free RL-based safety filter framework. During training, environment observations are stored in the replay buffers of both task specific nominal and safety agents, enabling their simultaneous training. In testing, observations are processed by both policies, and the nominal action is filtered based on the safety agent's Q-function and threshold $\epsilon_2$.
  • Figure 2: The full state space is broken down into three regions, $\mathcal{X}_{safe}$, $\mathcal{X}_{irrec}$, $\mathcal{X}_{unsafe}$. $\mathcal{X}_{safe}$ is the region the control input $u$ can always be applied to prevent the system from entering $\mathcal{X}_{unsafe}$. $\mathcal{X}_{irrec}$ is the region where no control input can prevent entry into $\mathcal{X}_{unsafe}$. If a car is moving too fast and too close to the unsafe region, it is an example of the system being in the irrecoverable region.
  • Figure 3: A 1 Dimensional concept visualization of how the $V^*_{safe}$ could be like. $\epsilon_2=0$ is the threshold value that separates $\mathcal{X}_{\text{safe}}$ and $\mathcal{X}_{\text{irrec}}$. Because $V^*_{safe}$ is increasing deeper inside $\mathcal{X}_{\text{safe}}$, picking a threshold value $\hat{\epsilon}_2$ that is higher than the optimal $\epsilon$, will gives a more conservative estimate of the safe set.
  • Figure 4: Left: The actions of the safe policy as a function of the state space. Right: The value function of the safety agent for double integrator. The dark shaded area is the unsafe region.
  • Figure 5: Left: Performance of our filter evaluated by average episodic return and safety rate at different $\epsilon_2$ levels in the Dubin's car environment. Right: Real-life limb trajectory overlaid on the learned value function. The value function is a 2D projection (with velocities set to zero) of a 4D function learned through simulation, while trajectory values are recorded in real life. Dark dashed lines denote the 0 and 90 threshold contours.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1
  • proof