Table of Contents
Fetching ...

Safe Deep Policy Adaptation

Wenli Xiao, Tairan He, John Dolan, Guanya Shi

TL;DR

SafeDPA demonstrates notable generalizability, achieving a 300% increase in safety rate compared to the baselines, under unseen disturbances in real-world experiments, and shows the robustness of SafeDPA against learning errors and extra perturbations.

Abstract

A critical goal of autonomy and artificial intelligence is enabling autonomous robots to rapidly adapt in dynamic and uncertain environments. Classic adaptive control and safe control provide stability and safety guarantees but are limited to specific system classes. In contrast, policy adaptation based on reinforcement learning (RL) offers versatility and generalizability but presents safety and robustness challenges. We propose SafeDPA, a novel RL and control framework that simultaneously tackles the problems of policy adaptation and safe reinforcement learning. SafeDPA jointly learns adaptive policy and dynamics models in simulation, predicts environment configurations, and fine-tunes dynamics models with few-shot real-world data. A safety filter based on the Control Barrier Function (CBF) on top of the RL policy is introduced to ensure safety during real-world deployment. We provide theoretical safety guarantees of SafeDPA and show the robustness of SafeDPA against learning errors and extra perturbations. Comprehensive experiments on (1) classic control problems (Inverted Pendulum), (2) simulation benchmarks (Safety Gym), and (3) a real-world agile robotics platform (RC Car) demonstrate great superiority of SafeDPA in both safety and task performance, over state-of-the-art baselines. Particularly, SafeDPA demonstrates notable generalizability, achieving a 300% increase in safety rate compared to the baselines, under unseen disturbances in real-world experiments.

Safe Deep Policy Adaptation

TL;DR

SafeDPA demonstrates notable generalizability, achieving a 300% increase in safety rate compared to the baselines, under unseen disturbances in real-world experiments, and shows the robustness of SafeDPA against learning errors and extra perturbations.

Abstract

A critical goal of autonomy and artificial intelligence is enabling autonomous robots to rapidly adapt in dynamic and uncertain environments. Classic adaptive control and safe control provide stability and safety guarantees but are limited to specific system classes. In contrast, policy adaptation based on reinforcement learning (RL) offers versatility and generalizability but presents safety and robustness challenges. We propose SafeDPA, a novel RL and control framework that simultaneously tackles the problems of policy adaptation and safe reinforcement learning. SafeDPA jointly learns adaptive policy and dynamics models in simulation, predicts environment configurations, and fine-tunes dynamics models with few-shot real-world data. A safety filter based on the Control Barrier Function (CBF) on top of the RL policy is introduced to ensure safety during real-world deployment. We provide theoretical safety guarantees of SafeDPA and show the robustness of SafeDPA against learning errors and extra perturbations. Comprehensive experiments on (1) classic control problems (Inverted Pendulum), (2) simulation benchmarks (Safety Gym), and (3) a real-world agile robotics platform (RC Car) demonstrate great superiority of SafeDPA in both safety and task performance, over state-of-the-art baselines. Particularly, SafeDPA demonstrates notable generalizability, achieving a 300% increase in safety rate compared to the baselines, under unseen disturbances in real-world experiments.
Paper Structure (24 sections, 1 theorem, 5 equations, 7 figures, 1 table)

This paper contains 24 sections, 1 theorem, 5 equations, 7 figures, 1 table.

Key Result

Theorem 1

Under asm:errorasm:continuity, then solving the safety condition $p^T f\left(x_t\right)+p^T g\left(x_t\right) a_t+p^T q \geq (1-\eta) h\left(x_t\right)-\epsilon$ in eq:cbf-qp will guarantee the forward invariance of the safe set $\mathcal{C}$ (i.e., $x_{t+n} \in \mathcal{C}, \forall n=1,2,3\cdots$)

Figures (7)

  • Figure 1: Overview of the four phases of SafeDPA. In Phase 1 (a), the environment encoder $\mu_{\theta_\mu}$ and dynamics model $f_{\theta_f}$, $g_{\theta_g}$ are jointly trained with offline dataset collected by a random policy in simulation. In Phase 1 (b), we make the parameters of environments encoder $\mu_{\theta_\mu}$ frozen, and the base policy is trained in simulation using model-free RL. In Phase 2, we train the adaption module $\phi_{\theta}(x_{t-k:t-1}, a_{t-k:t-1})$ to fit environments encoder $\mu_{\theta_\mu}$ with the history of state and actions with on-policy data. In Phase 3, we fine-tune our learned dynamics model trained in simulation with few-shot real-world data. In Phase 4, we leverage the learned adaptive dynamics to construct a CBF-based safety filter on top of the adaptive RL policy to ensure safety during real-world deployment.
  • Figure 2: Comparison of SafeDPA, SafeDPA without fine-tuning, and RMA with penalty on RC Car platforms. We showcase the successful trajectory of SafeDPA in four tasks, alongside instances of an unsafe event (i.e., collision) for SafeDPA without fine-tuning and RMA with penalty. SafeDPA safely achieves the goal, although in training or fine-tuning it never sees the box or chairs. This highlights the exceptional generalizability and adaptability of SafeDPA. We present video demonstrations in https://sites.google.com/view/safe-deep-policy-adaptation.
  • Figure 3: The area of the colored region represents the safety rate. In the inverted pendulum task, SafeDPA consistently achieves the highest safety rate with zero violations across all directions, surpassing $\text{Fix-}\alpha$ and Mix where these baselines only maintain high safety rates in specific directions.
  • Figure 4: On the left, we show the InvPendulum-hazard environment. On the right, we demonstrate the success rate and safety rate of SafeDPA and baselines where SafeDPA is the only algorithm that achieves both 100% for success rate and safety rate.
  • Figure 5: The radius of the circles indicates task performance (success rate), while the size of the colored area represents the safety rate. SafeDPA stands out with the highest safety rate $97.5\%$. Though PPO and TRPO have marginally higher success rates, they frequently violate safety constraints.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1: Safe Control
  • proof