Table of Contents
Fetching ...

Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning

Oswin So, Eric Yang Yu, Songyuan Zhang, Matthew Cleaveland, Mitchell Black, Chuchu Fan

TL;DR

Feasibility-Guided Exploration (FGE), a method that simultaneously identifies a subset of feasible initial conditions under which a safe policy exists, and learns a policy to solve the reachability problem over this set of initial conditions.

Abstract

Recent advances in deep reinforcement learning (RL) have achieved strong results on high-dimensional control tasks, but applying RL to reachability problems raises a fundamental mismatch: reachability seeks to maximize the set of states from which a system remains safe indefinitely, while RL optimizes expected returns over a user-specified distribution. This mismatch can result in policies that perform poorly on low-probability states that are still within the safe set. A natural alternative is to frame the problem as a robust optimization over a set of initial conditions that specify the initial state, dynamics and safe set, but whether this problem has a solution depends on the feasibility of the specified set, which is unknown a priori. We propose Feasibility-Guided Exploration (FGE), a method that simultaneously identifies a subset of feasible initial conditions under which a safe policy exists, and learns a policy to solve the reachability problem over this set of initial conditions. Empirical results demonstrate that FGE learns policies with over 50% more coverage than the best existing method for challenging initial conditions across tasks in the MuJoCo simulator and the Kinetix simulator with pixel observations.

Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning

TL;DR

Feasibility-Guided Exploration (FGE), a method that simultaneously identifies a subset of feasible initial conditions under which a safe policy exists, and learns a policy to solve the reachability problem over this set of initial conditions.

Abstract

Recent advances in deep reinforcement learning (RL) have achieved strong results on high-dimensional control tasks, but applying RL to reachability problems raises a fundamental mismatch: reachability seeks to maximize the set of states from which a system remains safe indefinitely, while RL optimizes expected returns over a user-specified distribution. This mismatch can result in policies that perform poorly on low-probability states that are still within the safe set. A natural alternative is to frame the problem as a robust optimization over a set of initial conditions that specify the initial state, dynamics and safe set, but whether this problem has a solution depends on the feasibility of the specified set, which is unknown a priori. We propose Feasibility-Guided Exploration (FGE), a method that simultaneously identifies a subset of feasible initial conditions under which a safe policy exists, and learns a policy to solve the reachability problem over this set of initial conditions. Empirical results demonstrate that FGE learns policies with over 50% more coverage than the best existing method for challenging initial conditions across tasks in the MuJoCo simulator and the Kinetix simulator with pixel observations.
Paper Structure (51 sections, 8 theorems, 71 equations, 32 figures, 2 tables, 3 algorithms)

This paper contains 51 sections, 8 theorems, 71 equations, 32 figures, 2 tables, 3 algorithms.

Key Result

Proposition 1

If $\mathcal{S}_0 \subseteq \mathcal{S}$, then the parameter-robust avoid problem is feasible.

Figures (32)

  • Figure 1: FGE modifies the parameter sampling distribution. FGE modifies the base distribution over the parameters by introducing an explore distribution to improve policy performance on parameters that have not been observed to be safe so far, and a rehearsal distribution obtained via sampling-based approximate best response to train on poorly performing parameters that were previously solved. We combine all three distributions to obtain the final distribution for FGE that balances the objectives of maximizing safety rate gain and minimizing safety rate loss.
  • Figure 2: Adaptive Cruise Control Example. We illustrate an example using an autonomous driving scenario. The parameters $\theta = [\bar{a}, v_0]$ model varying max acceleration $\bar{a}$ (e.g., from weather) and initial relative velocity $\Delta v_0$. We want to avoid crashing into the car in front ($\Delta p \leq 0$). (Left) We plot trajectories for different $\bar{a}$ and (Right) visualize the feasible parameter set $\Theta^*$ in green.
  • Figure 3: (Left) We only have positive labels for feasibility from the set of parameters that were measured to be safe $\mathcal{D}_{\mathfrak{f}{}}{}$. Since we don't have negative labels, standard supervised learning cannot be used. (Center) We address this by defining a target distribution $p_{\text{mix}}$ via mixing positive labels from $\color{gggrey} p_{\mathcal{D}_{\mathfrak{f}{}}}(\theta)$ with noisy labels from $\color{ggpurpleDark}p^\pi$, controlled by $\alpha$. (Right) Approximating $\color{colorpmix}p_{\text{mix}}(\mathfrak{f}{}=1|\theta)$ with $\color{ggblue} q_\psi(\mathfrak{f}{}=1 | \theta)$ using rollout samples and $\color{gggrey}\mathcal{D}_{\mathfrak{f}{}}$ samples yields a feasibility classifier with no false positives and a controllable false-negative rate with $\alpha$ and $\rho$ (\ref{['thm:ci_clsfy_probs']}).
  • Figure 4: Gradient Descent Ascent (GDA) vs FGE on a Bilinear Game. On $\max_{\pi \in [-1,1]} \min_{\theta \in [-1, 1]} \pi \theta$ for $\pi$, GDA fails to converge in last-iterate, while FGE using \ref{['eq:saddlepoint:approx_ftrl']} converges to the saddle point at the origin. Both converge in average iterate (see \ref{['app:bilinear_game']}). We plot the average $\theta$ for FGE.
  • Figure 5: Feasibility-Guided Exploration (FGE). Starting from an on-policy RL algorithm, we adapt the initial state distribution as a mixture of the base ( ), exploration ( ) and rehearsal ( ) distributions. The exploration component expands the feasible set and the rehearsal buffer targets feasible parameters that the current policy underperforms. After each episode, newly discovered feasible parameters $\theta$ are added to the dataset $\mathcal{D}_{\mathfrak{f}{}}$. A mixture of samples from $\mathcal{D}_{\mathfrak{f}{}}$ ( ) and the episode ( , ) forms $p_{\text{mix}}(\mathfrak{f}{}, \theta)$ used to train the feasibility classifier $q_\psi$. The feasibility classifier $q_\psi$ guides rejection sampling for the explore distribution. We also train a policy-conditioned classifier $p^\pi$ to predict $\pi$'s performance given $\theta$, enabling approximate best-response selection of worst-case feasible $\theta$ ( ) for rehearsal. We use the ACC example and plot its parameter space in all plots (see \ref{['fig:acc']}).
  • ...and 27 more figures

Theorems & Definitions (17)

  • Remark 1: Comparison to Density Modeling Approaches
  • Proposition 1
  • proof
  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • ...and 7 more