A Safety Modulator Actor-Critic Method in Model-Free Safe Reinforcement Learning and Application in UAV Hovering
Qihan Qi, Xinsong Yang, Gang Xia, Daniel W. C. Ho, Pengyang Tang
TL;DR
This work addresses safe reinforcement learning for UAV hover tasks by introducing SMAC, a model-free safety modulator actor-critic method. The core idea is to separate reward maximization from safety enforcement through a safety modulator that minimally perturbs a risky action, while a distributional critic mitigates Q-value overestimation under safety constraints. The framework combines a KL-divergence based distributional policy evaluation with a dual-critic setup and derives explicit gradient updates for both the risky policy and the safety modulator, ensuring safe learning and improved performance. Experiments in PyBullet simulations and real-world UAV hovering show that SMAC maintains safety constraints while achieving higher returns than baselines and demonstrates effective sim-to-real transfer, indicating practical safety and performance gains for model-free safe RL in UAV applications.
Abstract
This paper proposes a safety modulator actor-critic (SMAC) method to address safety constraint and overestimation mitigation in model-free safe reinforcement learning (RL). A safety modulator is developed to satisfy safety constraints by modulating actions, allowing the policy to ignore safety constraint and focus on maximizing reward. Additionally, a distributional critic with a theoretical update rule for SMAC is proposed to mitigate the overestimation of Q-values with safety constraints. Both simulation and real-world scenarios experiments on Unmanned Aerial Vehicles (UAVs) hovering confirm that the SMAC can effectively maintain safety constraints and outperform mainstream baseline algorithms.
