Table of Contents
Fetching ...

Learning Robust Policies via Interpretable Hamilton-Jacobi Reachability-Guided Disturbances

Hanyang Hu, Xilun Zhang, Xubo Lyu, Mo Chen

TL;DR

This paper proposes a robust policy training framework that integrates model-based control principles with adversarial RL training to improve robustness without the need for external black-box adversaries, and introduces a novel Hamilton-Jacobi reachability-guided disturbance for adversarial RL training.

Abstract

Deep Reinforcement Learning (RL) has shown remarkable success in robotics with complex and heterogeneous dynamics. However, its vulnerability to unknown disturbances and adversarial attacks remains a significant challenge. In this paper, we propose a robust policy training framework that integrates model-based control principles with adversarial RL training to improve robustness without the need for external black-box adversaries. Our approach introduces a novel Hamilton-Jacobi reachability-guided disturbance for adversarial RL training, where we use interpretable worst-case or near-worst-case disturbances as adversaries against the robust policy. We evaluated its effectiveness across three distinct tasks: a reach-avoid game in both simulation and real-world settings, and a highly dynamic quadrotor stabilization task in simulation. We validate that our learned critic network is consistent with the ground-truth HJ value function, while the policy network shows comparable performance with other learning-based methods.

Learning Robust Policies via Interpretable Hamilton-Jacobi Reachability-Guided Disturbances

TL;DR

This paper proposes a robust policy training framework that integrates model-based control principles with adversarial RL training to improve robustness without the need for external black-box adversaries, and introduces a novel Hamilton-Jacobi reachability-guided disturbance for adversarial RL training.

Abstract

Deep Reinforcement Learning (RL) has shown remarkable success in robotics with complex and heterogeneous dynamics. However, its vulnerability to unknown disturbances and adversarial attacks remains a significant challenge. In this paper, we propose a robust policy training framework that integrates model-based control principles with adversarial RL training to improve robustness without the need for external black-box adversaries. Our approach introduces a novel Hamilton-Jacobi reachability-guided disturbance for adversarial RL training, where we use interpretable worst-case or near-worst-case disturbances as adversaries against the robust policy. We evaluated its effectiveness across three distinct tasks: a reach-avoid game in both simulation and real-world settings, and a highly dynamic quadrotor stabilization task in simulation. We validate that our learned critic network is consistent with the ground-truth HJ value function, while the policy network shows comparable performance with other learning-based methods.
Paper Structure (13 sections, 12 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 12 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: HJARL computes the HJ value functions offline and uses them to generate adversarial disturbances during the online training. The trained robust policy is then deployed to handle various disturbances.
  • Figure 2: Trained critic networks heatmaps and the zero-level $\mathcal{RA}^{11}_{\infty}(R^{11}, A^{11})$ (purple dash lines) with SIG dynamics. The first and the second rows show the values of the defender's initial positions at $[0.5,0.0]$ and $[-0.5, -0.5]$ respectively (magenta stars).
  • Figure 3: Trained policy game performances and the zero-level $\mathcal{RA}^{11}_{\infty}(R^{11}, A^{11})$ (purple dash lines) with SIG dynamics. Initial attacker positions are uniformly generated across the map at intervals of 0.05 grid units, with the defender’s initial position fixed. The first and the second rows show the game results of the defender's initial positions at $[0.5,0.0]$ and $[-0.5, -0.5]$ respectively (magenta stars).
  • Figure 4: The real-world one vs. one reach-avoid game with two TurtleBot3 Burger robots.
  • Figure 5: Trained critic networks heatmaps and the zero-level HJ BRT (purple dash lines) with DubinCar model. The initial defender is at $[0.7, -0.4, -0.5]$ with the arrow pointing in its direction (magenta square and arrow).