Table of Contents
Fetching ...

Meta SAC-Lag: Towards Deployable Safe Reinforcement Learning via MetaGradient-based Hyperparameter Tuning

Homayoun Honari, Amir Mehdi Soufi Enayati, Mehran Ghafarian Tamizi, Homayoun Najjaran

TL;DR

This work tackles the deployability hurdle of safe reinforcement learning by eliminating manual tuning of safety thresholds in Lagrangian-based methods. It introduces Meta SAC-Lag, a model-free architecture that leverages meta-gradient optimization to automatically adjust the safety threshold $\varepsilon$ and the entropy temperature $\alpha$ within a SAC-Lagrangian framework. The approach combines inner updates of policy and Lagrange multipliers with outer meta-updates driven by two objectives $\mathcal{J}_\varepsilon$ and $\mathcal{J}_\alpha$, and validates performance across five simulated robotic tasks and a real Kinova Gen3 pour-coffee task. Results show improved safety-performance trade-offs and reduced hyperparameter tuning requirements, demonstrating practical deployability of safe RL in real-world settings.

Abstract

Safe Reinforcement Learning (Safe RL) is one of the prevalently studied subcategories of trial-and-error-based methods with the intention to be deployed on real-world systems. In safe RL, the goal is to maximize reward performance while minimizing constraints, often achieved by setting bounds on constraint functions and utilizing the Lagrangian method. However, deploying Lagrangian-based safe RL in real-world scenarios is challenging due to the necessity of threshold fine-tuning, as imprecise adjustments may lead to suboptimal policy convergence. To mitigate this challenge, we propose a unified Lagrangian-based model-free architecture called Meta Soft Actor-Critic Lagrangian (Meta SAC-Lag). Meta SAC-Lag uses meta-gradient optimization to automatically update the safety-related hyperparameters. The proposed method is designed to address safe exploration and threshold adjustment with minimal hyperparameter tuning requirement. In our pipeline, the inner parameters are updated through the conventional formulation and the hyperparameters are adjusted using the meta-objectives which are defined based on the updated parameters. Our results show that the agent can reliably adjust the safety performance due to the relatively fast convergence rate of the safety threshold. We evaluate the performance of Meta SAC-Lag in five simulated environments against Lagrangian baselines, and the results demonstrate its capability to create synergy between parameters, yielding better or competitive results. Furthermore, we conduct a real-world experiment involving a robotic arm tasked with pouring coffee into a cup without spillage. Meta SAC-Lag is successfully trained to execute the task, while minimizing effort constraints.

Meta SAC-Lag: Towards Deployable Safe Reinforcement Learning via MetaGradient-based Hyperparameter Tuning

TL;DR

This work tackles the deployability hurdle of safe reinforcement learning by eliminating manual tuning of safety thresholds in Lagrangian-based methods. It introduces Meta SAC-Lag, a model-free architecture that leverages meta-gradient optimization to automatically adjust the safety threshold and the entropy temperature within a SAC-Lagrangian framework. The approach combines inner updates of policy and Lagrange multipliers with outer meta-updates driven by two objectives and , and validates performance across five simulated robotic tasks and a real Kinova Gen3 pour-coffee task. Results show improved safety-performance trade-offs and reduced hyperparameter tuning requirements, demonstrating practical deployability of safe RL in real-world settings.

Abstract

Safe Reinforcement Learning (Safe RL) is one of the prevalently studied subcategories of trial-and-error-based methods with the intention to be deployed on real-world systems. In safe RL, the goal is to maximize reward performance while minimizing constraints, often achieved by setting bounds on constraint functions and utilizing the Lagrangian method. However, deploying Lagrangian-based safe RL in real-world scenarios is challenging due to the necessity of threshold fine-tuning, as imprecise adjustments may lead to suboptimal policy convergence. To mitigate this challenge, we propose a unified Lagrangian-based model-free architecture called Meta Soft Actor-Critic Lagrangian (Meta SAC-Lag). Meta SAC-Lag uses meta-gradient optimization to automatically update the safety-related hyperparameters. The proposed method is designed to address safe exploration and threshold adjustment with minimal hyperparameter tuning requirement. In our pipeline, the inner parameters are updated through the conventional formulation and the hyperparameters are adjusted using the meta-objectives which are defined based on the updated parameters. Our results show that the agent can reliably adjust the safety performance due to the relatively fast convergence rate of the safety threshold. We evaluate the performance of Meta SAC-Lag in five simulated environments against Lagrangian baselines, and the results demonstrate its capability to create synergy between parameters, yielding better or competitive results. Furthermore, we conduct a real-world experiment involving a robotic arm tasked with pouring coffee into a cup without spillage. Meta SAC-Lag is successfully trained to execute the task, while minimizing effort constraints.
Paper Structure (16 sections, 28 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 28 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Safety-critical environments used to deploy Meta SAC-Lag. The top two rows represent simulated environments with four general safety topics: locomotion (a), obstacle avoidance (b,c), robotic manipulation (d), dexterous manipulation (e). The bottom row represents Pour Coffee environment (f,g) used to study the deployability of the algorithm in a real-world setup.
  • Figure 2: The Computational Graph of the Meta SAC-Lag.
  • Figure 3: Performance of Meta SAC-Lag compared with the baseline algorithms. (Top row): Reward performance during the learning process. (Higher values are better) (Middle row): The value of Exploration hyperparameter ($\alpha$). (Bottom row): Episodic policy safety performance of the algorithms during the learning process. (Lower values are better). The dashed lines illustrate the constraint threshold value ($\varepsilon$).
  • Figure 4: Deployment results of Meta SAC-Lag on the real-world setup. (a) and (b) represent the jerk and acceleration of the end effector during the training process. (c) shows the final success rate of the algorithms.