Safe Reinforcement Learning in a Simulated Robotic Arm

Luka Kovač; Igor Farkaš

Safe Reinforcement Learning in a Simulated Robotic Arm

Luka Kovač, Igor Farkaš

TL;DR

The paper addresses safe reinforcement learning for robotic manipulation by extending Safety Gym with a PyBullet-based Panda arm to evaluate safety-constrained policies. It adopts a constrained RL formulation where the objective $J_r(\pi)$ is maximized subject to $J_{c_i}(\pi) \le d_i$ using a Lagrangian relaxation, implemented through PPO and its constrained variant $cPPO$. Pilot experiments indicate that $cPPO$ achieves comparable reward while reducing safety costs, albeit with longer training times. The work provides an extensible software setup for benchmarking safe RL in robotic arms and highlights potential applications in human-robot interaction.

Abstract

Reinforcement learning (RL) agents need to explore their environments in order to learn optimal policies. In many environments and tasks, safety is of critical importance. The widespread use of simulators offers a number of advantages, including safe exploration which will be inevitable in cases when RL systems need to be trained directly in the physical environment (e.g. in human-robot interaction). The popular Safety Gym library offers three mobile agent types that can learn goal-directed tasks while considering various safety constraints. In this paper, we extend the applicability of safe RL algorithms by creating a customized environment with Panda robotic arm where Safety Gym algorithms can be tested. We performed pilot experiments with the popular PPO algorithm comparing the baseline with the constrained version and show that the constrained version is able to learn the equally good policy while better complying with safety constraints and taking longer training time as expected.

Safe Reinforcement Learning in a Simulated Robotic Arm

TL;DR

is maximized subject to

using a Lagrangian relaxation, implemented through PPO and its constrained variant

. Pilot experiments indicate that

achieves comparable reward while reducing safety costs, albeit with longer training times. The work provides an extensible software setup for benchmarking safe RL in robotic arms and highlights potential applications in human-robot interaction.

Abstract

Paper Structure (4 sections, 2 equations, 2 figures, 1 table)

This paper contains 4 sections, 2 equations, 2 figures, 1 table.

Introduction
Finding a technical solution
Experiments
Conclusion

Figures (2)

Figure 1: Panda arm learned to reach the target (yellow cube) without colliding with an obstacle (red) in front of it.
Figure 2: Comparison of PPO and cPPO using panda arm in terms of reward (left) and cost (right). Constrained PPO is slower in learning and reaching the reward. On the other hand, it is keeping the cost at lower values hence making the arm behavior safer.

Safe Reinforcement Learning in a Simulated Robotic Arm

TL;DR

Abstract

Safe Reinforcement Learning in a Simulated Robotic Arm

Authors

TL;DR

Abstract

Table of Contents

Figures (2)