Table of Contents
Fetching ...

Reinforcement Learning in a Safety-Embedded MDP with Trajectory Optimization

Fan Yang, Wenxuan Zhou, Zuxin Liu, Ding Zhao, David Held

TL;DR

This work tackles safety in reinforcement learning for robotics by formulating a safety-constrained problem within a Constrained MDP and proposing a hierarchical Safety-Embedded MDP (SEMDP) that integrates a trajectory optimizer. The high-level RL agent outputs subgoals which are translated into safe trajectories by the optimizer, while a trajectory-following module executes the plan; this decouples reward optimization from safety enforcement and enables stable, long-horizon policy learning. Empirical results on Safety Gym Push tasks and a real-robot box-pushing task show significantly higher rewards with near-zero safety violations compared to baselines, including robust performance under perception noise and across multiple robot morphologies. The approach demonstrates that embedding optimization-based safety into the transition dynamics can yield superior, practically relevant safety and performance gains, with potential extensions to dynamic obstacles and manipulation tasks.

Abstract

Safe Reinforcement Learning (RL) plays an important role in applying RL algorithms to safety-critical real-world applications, addressing the trade-off between maximizing rewards and adhering to safety constraints. This work introduces a novel approach that combines RL with trajectory optimization to manage this trade-off effectively. Our approach embeds safety constraints within the action space of a modified Markov Decision Process (MDP). The RL agent produces a sequence of actions that are transformed into safe trajectories by a trajectory optimizer, thereby effectively ensuring safety and increasing training stability. This novel approach excels in its performance on challenging Safety Gym tasks, achieving significantly higher rewards and near-zero safety violations during inference. The method's real-world applicability is demonstrated through a safe and effective deployment in a real robot task of box-pushing around obstacles.

Reinforcement Learning in a Safety-Embedded MDP with Trajectory Optimization

TL;DR

This work tackles safety in reinforcement learning for robotics by formulating a safety-constrained problem within a Constrained MDP and proposing a hierarchical Safety-Embedded MDP (SEMDP) that integrates a trajectory optimizer. The high-level RL agent outputs subgoals which are translated into safe trajectories by the optimizer, while a trajectory-following module executes the plan; this decouples reward optimization from safety enforcement and enables stable, long-horizon policy learning. Empirical results on Safety Gym Push tasks and a real-robot box-pushing task show significantly higher rewards with near-zero safety violations compared to baselines, including robust performance under perception noise and across multiple robot morphologies. The approach demonstrates that embedding optimization-based safety into the transition dynamics can yield superior, practically relevant safety and performance gains, with potential extensions to dynamic obstacles and manipulation tasks.

Abstract

Safe Reinforcement Learning (RL) plays an important role in applying RL algorithms to safety-critical real-world applications, addressing the trade-off between maximizing rewards and adhering to safety constraints. This work introduces a novel approach that combines RL with trajectory optimization to manage this trade-off effectively. Our approach embeds safety constraints within the action space of a modified Markov Decision Process (MDP). The RL agent produces a sequence of actions that are transformed into safe trajectories by a trajectory optimizer, thereby effectively ensuring safety and increasing training stability. This novel approach excels in its performance on challenging Safety Gym tasks, achieving significantly higher rewards and near-zero safety violations during inference. The method's real-world applicability is demonstrated through a safe and effective deployment in a real robot task of box-pushing around obstacles.
Paper Structure (44 sections, 8 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 44 sections, 8 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Compared to previous methods, in which the RL agent optimizes the reward and safety constraints simultaneously (left), our method operates in a modified MDP (right). The modified MDP is embedded with a trajectory optimizer to ensure constraint satisfaction. The RL agent outputs a subgoal for the safe trajectory optimizer and hence the RL agent only needs to optimize explicitly for the reward, leading to much better performance with fewer safety violations.
  • Figure 2: An illustration of our method: in the Safety Gym Push task, the objective of the agent (red) is to push the box (yellow) to a goal (green) while avoiding obstacles (purple). Our method embeds safety constraints into the low-level trajectory optimizer to generate a safe trajectory (the dark green dots) leading toward the subgoal. The high-level RL policy outputs a subgoal (the red flag). The RL policy continually updates the subgoal output to achieve the task.
  • Figure 3: Training curves of our method compared to the baseline methods. The shadow region denotes the standard error of different seeds. Our method starts from $1\mathrm{e}{6}$ steps instead of $0$ to denote the training of the goal-reaching policy. In these experiments, the cost is defined as the total number of time steps for which the agent violates the safety constraints within an episode. Our method achieves a lower cost than the baselines. It still incurs some cost during training because, during training time, we are using a fixed Lagrangian parameter for computation reasons and to encourage exploration.
  • Figure 4: We set up a real-robot environment similar to the Safety Gym Push task. The fingertip of the Franka robot (pink) is used to push the box (black) toward the goal (green). It needs to avoid hazards (red) and avoid getting stuck at the pillar (blue). Each row shows four frames of a single episode. We compare our method with TRPO Lagrangian, which has the best performance among the baselines based on the simulation experiments.
  • Figure 5: Additional training curves of our method compared to the baseline methods. The shadow region denotes the standard error of different seeds. Our method starts from $1\mathrm{e}{6}$ steps instead of $0$ to denote the training of the goal-reaching policy. Our method achieves a lower cost than the baselines. It still incurs some cost during training because, during training time, we are using a fixed Lagrangian parameter for computation reasons and to encourage exploration.
  • ...and 6 more figures