Q-learning-based Model-free Safety Filter
Guo Ning Sue, Yogita Choudhary, Richard Desatnik, Carmel Majidi, John Dolan, Guanya Shi
TL;DR
This work tackles safety in real-world robotics with unknown dynamics by introducing a model-free safety filter learned via Q-learning. A novel reward design $r_{\text{safe}}$ partitions the state space into $\mathcal{X}_{\text{safe}}$, $\mathcal{X}_{\text{irrec}}$, and $\mathcal{X}_{\text{unsafe}}$, enabling a safety value function $V_{\text{safe}}^*$ and Q-function $Q_{\text{safe}}^*$ to guide action filtering through a threshold $\epsilon_2$. The method supports simultaneous, off-policy training of a task policy and a safety policy with separate replay buffers, and includes an implementation that uses SAC for simulations and DQN for hardware. Empirical validation on a double integrator, Dubin's car, and a soft robotic limb demonstrates the framework’s effectiveness, robustness to suboptimal training, and ability to generalize to different task policies. The approach offers a practical, plug-and-play safety mechanism for model-free RL in complex robotic systems, with tunable conservatism via $\epsilon_2$ and potential for broader applicability beyond the tested domains.
Abstract
Ensuring safety via safety filters in real-world robotics presents significant challenges, particularly when the system dynamics is complex or unavailable. To handle this issue, learning-based safety filters recently gained popularity, which can be classified as model-based and model-free methods. Existing model-based approaches requires various assumptions on system model (e.g., control-affine), which limits their application in complex systems, and existing model-free approaches need substantial modifications to standard RL algorithms and lack versatility. This paper proposes a simple, plugin-and-play, and effective model-free safety filter learning framework. We introduce a novel reward formulation and use Q-learning to learn Q-value functions to safeguard arbitrary task specific nominal policies via filtering out their potentially unsafe actions. The threshold used in the filtering process is supported by our theoretical analysis. Due to its model-free nature and simplicity, our framework can be seamlessly integrated with various RL algorithms. We validate the proposed approach through simulations on double integrator and Dubin's car systems and demonstrate its effectiveness in real-world experiments with a soft robotic limb.
