Table of Contents
Fetching ...

Solving Reach-Avoid-Stay Problems Using Deep Deterministic Policy Gradients

Gabriel Chenevert, Jingqi Li, Achyuta kannan, Sangjae Bae, Donggun Lee

TL;DR

This paper proposes a two-step deep deterministic policy gradient (DDPG) method to extend RL-based reachability method to solve RAS problems and proves that this method results in the maximal robust RAS set in the absence of training errors.

Abstract

Reach-Avoid-Stay (RAS) optimal control enables systems such as robots and air taxis to reach their targets, avoid obstacles, and stay near the target. However, current methods for RAS often struggle with handling complex, dynamic environments and scaling to high-dimensional systems. While reinforcement learning (RL)-based reachability analysis addresses these challenges, it has yet to tackle the RAS problem. In this paper, we propose a two-step deep deterministic policy gradient (DDPG) method to extend RL-based reachability method to solve RAS problems. First, we train a function that characterizes the maximal robust control invariant set within the target set, where the system can safely stay, along with its corresponding policy. Second, we train a function that defines the set of states capable of safely reaching the robust control invariant set, along with its corresponding policy. We prove that this method results in the maximal robust RAS set in the absence of training errors and demonstrate that it enables RAS in complex environments, scales to high-dimensional systems, and achieves higher success rates for the RAS task compared to previous methods, validated through one simulation and two high-dimensional experiments.

Solving Reach-Avoid-Stay Problems Using Deep Deterministic Policy Gradients

TL;DR

This paper proposes a two-step deep deterministic policy gradient (DDPG) method to extend RL-based reachability method to solve RAS problems and proves that this method results in the maximal robust RAS set in the absence of training errors.

Abstract

Reach-Avoid-Stay (RAS) optimal control enables systems such as robots and air taxis to reach their targets, avoid obstacles, and stay near the target. However, current methods for RAS often struggle with handling complex, dynamic environments and scaling to high-dimensional systems. While reinforcement learning (RL)-based reachability analysis addresses these challenges, it has yet to tackle the RAS problem. In this paper, we propose a two-step deep deterministic policy gradient (DDPG) method to extend RL-based reachability method to solve RAS problems. First, we train a function that characterizes the maximal robust control invariant set within the target set, where the system can safely stay, along with its corresponding policy. Second, we train a function that defines the set of states capable of safely reaching the robust control invariant set, along with its corresponding policy. We prove that this method results in the maximal robust RAS set in the absence of training errors and demonstrate that it enables RAS in complex environments, scales to high-dimensional systems, and achieves higher success rates for the RAS task compared to previous methods, validated through one simulation and two high-dimensional experiments.
Paper Structure (12 sections, 2 theorems, 34 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 12 sections, 2 theorems, 34 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

The super zero-level set of $V$eq:vValue is the maximal robust RAS set, $\mathcal{RAS}$eq:RAS_Set_Definition. Thus, $V(x)\leq 0$ if and only if $x\notin \mathcal{RAS}$.

Figures (6)

  • Figure 1: The reach-avoid-stay problem can be described as controlling a system from an initial state around all obstacles to a target set where it remains. The reach-avoid-stay set (blue region) is a set of states where it is possible for our two step process to achieve the reach-avoid-stay task. First, we apply a reach-avoid policy $\pi_V$ to move the system with initial state represented by the black $X$ to the control-invariant set (green) within the target set (light green). Second, we use the invariant policy $\pi_H$ to keep the system within the target set indefinitely. Note that $g$ and $l$ functions characterize the target set and the obstacle, respectively.
  • Figure 2: This figure illustrates the maximal RAS set $\mathcal{RAS}$ (black) and a state trajectory (green and blue) that successfully achieves the RAS task under adversarial disturbances, with an initial state of $[4.5; 0]$. $\pi_H$ is applied outside the maximal viability kernel $\mathcal{AS}$ (red) that drives the green trajectory, while $\pi_V$ is applied inside it to drive the blue trajectory.
  • Figure 3: This figure illustrates two RAS sets corresponding to the target set $T$ and obstacle set $C$, using our framework and a baseline CLBF method meng_lyapunov-barrier_2022, as well as the color map of $V$\ref{['eq:vValue']}. The maximal RAS set $\mathcal{RAS}$ (red) by our method is a superset of the RAS set provided by a baseline CLBF method (dashed magenta).
  • Figure 4: Visualization of the learned RAS value function, RA value function, and the simulation of their policies for the VTOL example. We parameterize both the RA and RAS value functions and their associated policies using 4-layer ReLU neural networks, each with 512 neurons per layer. A: The learned value functions, where the x and y positions of the ego drone are varied, while the remaining 10 state dimensions are fixed. B: The learned RAS set (the super-zero level set of the RAS value function) and the learned RA set (the super-zero level set of the RA value function). In particular, the red dot represents drone 1, the green dot represents drone 2, and the grey dot represents the static cylinder obstacle. The deep red areas near these objects indicate that both value functions accurately capture the safety information around them. C: Simulation trajectories for both the RAS and RA policies. Under the RAS policy, the trajectory safely reaches and stays within the target set. Conversely, under the RA policy, the trajectory reaches the target set safely but with a high speed, and therefore it leaves after entering.
  • Figure 5: VTOL Demonstration. The orange line represents the ego drone. The red and teal lines represent obstacle drones 1 and 2 respectively.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Remark 1
  • Remark 2
  • Theorem 1
  • proof
  • Corollary 1