Table of Contents
Fetching ...

A Safety Modulator Actor-Critic Method in Model-Free Safe Reinforcement Learning and Application in UAV Hovering

Qihan Qi, Xinsong Yang, Gang Xia, Daniel W. C. Ho, Pengyang Tang

TL;DR

This work addresses safe reinforcement learning for UAV hover tasks by introducing SMAC, a model-free safety modulator actor-critic method. The core idea is to separate reward maximization from safety enforcement through a safety modulator that minimally perturbs a risky action, while a distributional critic mitigates Q-value overestimation under safety constraints. The framework combines a KL-divergence based distributional policy evaluation with a dual-critic setup and derives explicit gradient updates for both the risky policy and the safety modulator, ensuring safe learning and improved performance. Experiments in PyBullet simulations and real-world UAV hovering show that SMAC maintains safety constraints while achieving higher returns than baselines and demonstrates effective sim-to-real transfer, indicating practical safety and performance gains for model-free safe RL in UAV applications.

Abstract

This paper proposes a safety modulator actor-critic (SMAC) method to address safety constraint and overestimation mitigation in model-free safe reinforcement learning (RL). A safety modulator is developed to satisfy safety constraints by modulating actions, allowing the policy to ignore safety constraint and focus on maximizing reward. Additionally, a distributional critic with a theoretical update rule for SMAC is proposed to mitigate the overestimation of Q-values with safety constraints. Both simulation and real-world scenarios experiments on Unmanned Aerial Vehicles (UAVs) hovering confirm that the SMAC can effectively maintain safety constraints and outperform mainstream baseline algorithms.

A Safety Modulator Actor-Critic Method in Model-Free Safe Reinforcement Learning and Application in UAV Hovering

TL;DR

This work addresses safe reinforcement learning for UAV hover tasks by introducing SMAC, a model-free safety modulator actor-critic method. The core idea is to separate reward maximization from safety enforcement through a safety modulator that minimally perturbs a risky action, while a distributional critic mitigates Q-value overestimation under safety constraints. The framework combines a KL-divergence based distributional policy evaluation with a dual-critic setup and derives explicit gradient updates for both the risky policy and the safety modulator, ensuring safe learning and improved performance. Experiments in PyBullet simulations and real-world UAV hovering show that SMAC maintains safety constraints while achieving higher returns than baselines and demonstrates effective sim-to-real transfer, indicating practical safety and performance gains for model-free safe RL in UAV applications.

Abstract

This paper proposes a safety modulator actor-critic (SMAC) method to address safety constraint and overestimation mitigation in model-free safe reinforcement learning (RL). A safety modulator is developed to satisfy safety constraints by modulating actions, allowing the policy to ignore safety constraint and focus on maximizing reward. Additionally, a distributional critic with a theoretical update rule for SMAC is proposed to mitigate the overestimation of Q-values with safety constraints. Both simulation and real-world scenarios experiments on Unmanned Aerial Vehicles (UAVs) hovering confirm that the SMAC can effectively maintain safety constraints and outperform mainstream baseline algorithms.

Paper Structure

This paper contains 18 sections, 28 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: The framework graph features nodes representing variables and edges representing operations. Orange paths represent the gradient paths of $\theta_{\bar{u}}$, while purple paths represent the gradient paths of $\theta_{\Delta}$. Paths depicted in black or orange are detached for $\theta_{\Delta}$, and paths depicted in black or purple are detached for $\theta_{\bar{u}}$.
  • Figure 2: The Crazyflie 2.1 in PyBullet.
  • Figure 3: The average return training curves of SAC, SAC-Lag, and SMAC by running 5 times. The lines and the shaded area represent the average return and the 95% confidence interval, respectively.
  • Figure 4: The average cost training curves of SAC, SAC-Lag, and SMAC by running 5 times. The red dashed line is the safety constraint $C=50$.
  • Figure 5: The true average Q-value (solid lines) and estimated average Q-value (dashed lines) training curves by running 5 times at the 500th step per episode.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Remark 6