Table of Contents
Fetching ...

Solving Minimum-Cost Reach Avoid using Reinforcement Learning

Oswin So, Cheng Ge, Chuchu Fan

TL;DR

This work proposes RC-PPO, a reinforcement-learning-based method for solving the minimum-cost reach-avoid problem by using connections to Hamilton-Jacobi reachability, and demonstrates that RC-PPO learns policies with comparable goal-reaching rates to while achieving up to 57% lower cumulative costs.

Abstract

Current reinforcement-learning methods are unable to directly learn policies that solve the minimum cost reach-avoid problem to minimize cumulative costs subject to the constraints of reaching the goal and avoiding unsafe states, as the structure of this new optimization problem is incompatible with current methods. Instead, a surrogate problem is solved where all objectives are combined with a weighted sum. However, this surrogate objective results in suboptimal policies that do not directly minimize the cumulative cost. In this work, we propose RC-PPO, a reinforcement-learning-based method for solving the minimum-cost reach-avoid problem by using connections to Hamilton-Jacobi reachability. Empirical results demonstrate that RC-PPO learns policies with comparable goal-reaching rates to while achieving up to 57% lower cumulative costs compared to existing methods on a suite of minimum-cost reach-avoid benchmarks on the Mujoco simulator. The project page can be found at https://oswinso.xyz/rcppo.

Solving Minimum-Cost Reach Avoid using Reinforcement Learning

TL;DR

This work proposes RC-PPO, a reinforcement-learning-based method for solving the minimum-cost reach-avoid problem by using connections to Hamilton-Jacobi reachability, and demonstrates that RC-PPO learns policies with comparable goal-reaching rates to while achieving up to 57% lower cumulative costs.

Abstract

Current reinforcement-learning methods are unable to directly learn policies that solve the minimum cost reach-avoid problem to minimize cumulative costs subject to the constraints of reaching the goal and avoiding unsafe states, as the structure of this new optimization problem is incompatible with current methods. Instead, a surrogate problem is solved where all objectives are combined with a weighted sum. However, this surrogate objective results in suboptimal policies that do not directly minimize the cumulative cost. In this work, we propose RC-PPO, a reinforcement-learning-based method for solving the minimum-cost reach-avoid problem by using connections to Hamilton-Jacobi reachability. Empirical results demonstrate that RC-PPO learns policies with comparable goal-reaching rates to while achieving up to 57% lower cumulative costs compared to existing methods on a suite of minimum-cost reach-avoid benchmarks on the Mujoco simulator. The project page can be found at https://oswinso.xyz/rcppo.

Paper Structure

This paper contains 52 sections, 4 theorems, 69 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

For given initial conditions $x_0\in \mathcal{X}$, $z_0 \in \mathbb{R}$ and control policy $\pi$, consider the trajectory for the original system $\{x_0, \dots x_T\}$ and its corresponding trajectory for the augmented system $\{(x_0, y_0, z_0), \dots (x_T, y_T, z_T)\}$ for some $T>0$. Then, the reac

Figures (9)

  • Figure 1: Summary of the RC-PPO algorithm. In phase one, the original dynamic system is transformed into the augmented dynamic system defined in \ref{['equ:hyb_evo']}. Then RL is used to optimize value function $\Tilde{V}^{\pi}_{\hat{g}}$ and learn a stochastic policy $\pi$. In phase two, we fine-tune $\Tilde{V}^{\pi}_{\hat{g}}$ on a deterministic version of $\pi$ and compute the optimal upper-bound $z^*$ to obtain the optimal deterministic policy $\pi^*$.
  • Figure 2: Illustrations of the benchmark tasks. In each picture, red denotes the unsafe region to be avoided, while green denotes the goal region to be reached.
  • Figure 3: Reach rates under the sparse reward setting. RC-PPO consistently achieves the highest reach rates in all benchmark tasks. Error bars denote the standard error.
  • Figure 4: Cumulative cost (IQM) and reach rates under reward shaping on four selected benchmarks. RC-PPO achieves significantly lower cumulative costs while retaining comparable reach rates even when compared with baseline methods that use reward shaping.
  • Figure 5: Trajectory comparisons. On Pendulum, RC-PPO learns to perform an extensive energy pumping strategy to reach the goal upright position (green line), resulting in vastly lower cumulative energy. On WindField, RC-PPO takes advantage instead of fighting against the wind field, resulting in a faster trajectory to the goal region (green box) that uses lower cumulative energy. The start of the trajectory is marked by $\blacksquare$.
  • ...and 4 more figures

Theorems & Definitions (12)

  • Theorem 1
  • Remark 1: Connections to the epigraph form in constrained optimization
  • Definition 1: Stochastic Reachability Bellman Equation
  • Definition 2: Reachability Markov Decision Process
  • Theorem 2
  • Theorem 3
  • proof
  • proof
  • proof
  • proof
  • ...and 2 more