Table of Contents
Fetching ...

A Dynamic Safety Shield for Safe and Efficient Reinforcement Learning of Navigation Tasks

Murad Dawood, Ahmed Shokry, Maren Bennewitz

TL;DR

The paper tackles safe reinforcement learning for robotic navigation, addressing safety during training and the sim-to-real gap. It introduces a dynamic safety shield that merges a model predictive control (MPC) shield with a supervisor reinforcement learning agent to online-tune obstacle-avoidance weights and align shield actions with the task agent. The supervisor operates without goal information to minimize exploration overhead and is trained with a reward that penalizes collisions and action divergence, while the task agent learns goal-directed navigation. Across three simulated environments and a real-world robot, the approach achieves the best goals-to-collisions ratios, often with near-zero collisions, outperforming constrained RL baselines and classic MPC shields, thereby validating the practicality of integrating long-horizon predictions with learned safety control for safe and efficient navigation.

Abstract

Reinforcement learning (RL) has been successfully applied to a variety of robotics applications, where it outperforms classical methods. However, the safety aspect of RL and the transfer to the real world remain an open challenge. A prominent field for tackling this challenge and ensuring the safety of the agents during training and execution is safe reinforcement learning. Safe RL can be achieved through constrained RL and safe exploration approaches. The former learns the safety constraints over the course of training to achieve a safe behavior by the end of training, at the cost of high number of collisions at earlier stages of the training. The latter offers robust safety by enforcing the safety constraints as hard constraints, which prevents collisions but hinders the exploration of the RL agent, resulting in lower rewards and poor performance. To overcome those drawbacks, we propose a novel safety shield, that combines the robustness of the optimization-based controllers with the long prediction capabilities of the RL agents, allowing the RL agent to adaptively tune the parameters of the controller. Our approach is able to improve the exploration of the RL agents for navigation tasks, while minimizing the number of collisions. Experiments in simulation show that our approach outperforms state-of-the-art baselines in the reached goals-to-collisions ratio in different challenging environments. The goals-to-collisions ratio metrics emphasizes the importance of minimizing the number of collisions, while learning to accomplish the task. Our approach achieves a higher number of reached goals compared to the classic safety shields and fewer collisions compared to constrained RL approaches. Finally, we demonstrate the performance of the proposed method in a real-world experiment.

A Dynamic Safety Shield for Safe and Efficient Reinforcement Learning of Navigation Tasks

TL;DR

The paper tackles safe reinforcement learning for robotic navigation, addressing safety during training and the sim-to-real gap. It introduces a dynamic safety shield that merges a model predictive control (MPC) shield with a supervisor reinforcement learning agent to online-tune obstacle-avoidance weights and align shield actions with the task agent. The supervisor operates without goal information to minimize exploration overhead and is trained with a reward that penalizes collisions and action divergence, while the task agent learns goal-directed navigation. Across three simulated environments and a real-world robot, the approach achieves the best goals-to-collisions ratios, often with near-zero collisions, outperforming constrained RL baselines and classic MPC shields, thereby validating the practicality of integrating long-horizon predictions with learned safety control for safe and efficient navigation.

Abstract

Reinforcement learning (RL) has been successfully applied to a variety of robotics applications, where it outperforms classical methods. However, the safety aspect of RL and the transfer to the real world remain an open challenge. A prominent field for tackling this challenge and ensuring the safety of the agents during training and execution is safe reinforcement learning. Safe RL can be achieved through constrained RL and safe exploration approaches. The former learns the safety constraints over the course of training to achieve a safe behavior by the end of training, at the cost of high number of collisions at earlier stages of the training. The latter offers robust safety by enforcing the safety constraints as hard constraints, which prevents collisions but hinders the exploration of the RL agent, resulting in lower rewards and poor performance. To overcome those drawbacks, we propose a novel safety shield, that combines the robustness of the optimization-based controllers with the long prediction capabilities of the RL agents, allowing the RL agent to adaptively tune the parameters of the controller. Our approach is able to improve the exploration of the RL agents for navigation tasks, while minimizing the number of collisions. Experiments in simulation show that our approach outperforms state-of-the-art baselines in the reached goals-to-collisions ratio in different challenging environments. The goals-to-collisions ratio metrics emphasizes the importance of minimizing the number of collisions, while learning to accomplish the task. Our approach achieves a higher number of reached goals compared to the classic safety shields and fewer collisions compared to constrained RL approaches. Finally, we demonstrate the performance of the proposed method in a real-world experiment.

Paper Structure

This paper contains 16 sections, 4 equations, 6 figures.

Figures (6)

  • Figure 1: Architecture of our approach. The task agent (green) is responsible for learning the navigation task. The agent receives the $State_{Task}$ from the environment and outputs the linear and angular velocities ($v_{RL}, w_{RL}$). The supervisor agent (red) receives the $State_{Supervisor}$ from the environment and outputs the Weights for aligning the MPC-shield's actions with the task agent's actions, and the weights of the constraints. The MPC shield solves the optimal control problem (Eq. \ref{['eq:mpc_mod']}) using the all mentioned weights to find the safe actions $v_{MPC}, w_{MPC}$.
  • Figure 2: Environments used in the experiments, Fig.a environment with five pillars (blue), and Fig.b environment, which contains six pillars and six L-shaped walls (blue). Fig.c environment with eight pillars and eight L-shaped walls. All the obstacles are placed randomly at the beginning of each episode.
  • Figure 3: Results for all the approaches in the three environments. The bold lines show the average of three random seeds, while the shaded areas show the standard deviation over the runs. Our approach consistently achieves the highest goals-to-collisions ratio, which shows that our approach is able to safely guide the task agent to learn the navigation task with a few number of collisions. Note that some baselines reach up to 200 collisions, while our approach achieves near-zero collisions, overlapping with the MPC_TUNED at zero, as can be seen in the second row.
  • Figure 4: Results for the ablation study over three random seeds. Introducing the goal information to the supervisor agent, results in more collisions as the supervisor agent explores to reach more goals to maximize its rewards. The goals-to-collisions ratio for our approach without the goal information is higher than for the agent with the goal information. This shows that including the goal information for the supervisor agent does not improve the performance of the task agent.
  • Figure : Robot Trajectory
  • ...and 1 more figures