Table of Contents
Fetching ...

Safe and Near-Optimal Control with Online Dynamics Learning

Manish Prajapat, Johannes Köhler, Melanie N. Zeilinger, Andreas Krause

TL;DR

The notion of maximum safe dynamics learning, where sufficient exploration is performed within the space of safe policies is introduced, where sufficient exploration is performed within the space of safe policies, ensures continuous online learning of dynamics.

Abstract

Achieving both optimality and safety under unknown system dynamics is a central challenge in real-world deployment of agents. To address this, we introduce a notion of maximum safe dynamics learning, where sufficient exploration is performed within the space of safe policies. Our method executes $\textit{pessimistically}$ safe policies while $\textit{optimistically}$ exploring informative states and, despite not reaching them due to model uncertainty, ensures continuous online learning of dynamics. The framework achieves first-of-its-kind results: learning the dynamics model sufficiently $-$ up to an arbitrary small tolerance (subject to noise) $-$ in a finite time, while ensuring provably safe operation throughout with high probability and without requiring resets. Building on this, we propose an algorithm to maximize rewards while learning the dynamics $\textit{only to the extent needed}$ to achieve close-to-optimal performance. Unlike typical reinforcement learning (RL) methods, our approach operates online in a non-episodic setting and ensures safety throughout the learning process. We demonstrate the effectiveness of our approach in challenging domains such as autonomous car racing and drone navigation under aerodynamic effects $-$ scenarios where safety is critical and accurate modeling is difficult.

Safe and Near-Optimal Control with Online Dynamics Learning

TL;DR

The notion of maximum safe dynamics learning, where sufficient exploration is performed within the space of safe policies is introduced, where sufficient exploration is performed within the space of safe policies, ensures continuous online learning of dynamics.

Abstract

Achieving both optimality and safety under unknown system dynamics is a central challenge in real-world deployment of agents. To address this, we introduce a notion of maximum safe dynamics learning, where sufficient exploration is performed within the space of safe policies. Our method executes safe policies while exploring informative states and, despite not reaching them due to model uncertainty, ensures continuous online learning of dynamics. The framework achieves first-of-its-kind results: learning the dynamics model sufficiently up to an arbitrary small tolerance (subject to noise) in a finite time, while ensuring provably safe operation throughout with high probability and without requiring resets. Building on this, we propose an algorithm to maximize rewards while learning the dynamics to achieve close-to-optimal performance. Unlike typical reinforcement learning (RL) methods, our approach operates online in a non-episodic setting and ensures safety throughout the learning process. We demonstrate the effectiveness of our approach in challenging domains such as autonomous car racing and drone navigation under aerodynamic effects scenarios where safety is critical and accurate modeling is difficult.

Paper Structure

This paper contains 19 sections, 23 theorems, 87 equations, 11 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

Let assump:q_RKHS hold and $\sqrt{\beta_{n, i}} \!\coloneqq\! B_i \!+\! \sqrt{ \ln(\det(I_{D_n} \!+\! \sigma^{-2} K^{i}_{{D}_n})) + 2\ln({n_x}/\delta)}$. Then, it holds that $\mathrm{Pr}\left( {\bm{f}^{\star}\in\mathcal{F}_{n}, \forall n\in \mathbb{N}} \right) \geq 1-\delta$.

Figures (11)

  • Figure 1: Illustration of the online dynamics learning problem. A drone navigates a cluttered environment while satisfying safety constraints, despite its dynamics being a-priori unknown. These constraints require avoiding collisions with the orange obstacles and passing through the green gate. The green curve illustrates the optimal trajectory that the drone would have taken if the dynamics were known exactly. The dotted curve shows the actual trajectory executed by the drone, which deviates from the optimal path due to model uncertainty. The gray region represents the propagated uncertainty during planning at the current location of the drone. Initially, uncertainty is large, leading the drone to plan conservatively. As the drone learns the dynamics online, the propagated uncertainty shrinks, resulting in less conservative plans, and the executed trajectory gets close to optimal.
  • Figure 2: Illustration of policy set in (a) state space and (b) policy space. In \ref{['fig:dyn_opti_pessi_definition']}, the cyan region denotes the (invariant) safe set $\mathbb{X}_{n}$ and the green region represents the state constraint $\mathcal{X}$. The shaded region shows the reachable set under a pessimistic policy, which starts in the safe set and returns to it while satisfying the constraints. The green curve shows an informative trajectory ensuring sampling condition \ref{['eq:sampling_rule_timex']}. The orange curve shows a trajectory under another optimistic policy that ensures constraints are satisfied with an $\epsilon$-margin and is appended in the beginning by a small horizon $\delta h$ to move from ${x_{{}}}(k) \to {x_{{}}}'$ via policy $\hat{\pi}$. \ref{['fig:dyn_exploration_convergence']} shows \ref{['obj:maximum_exploration']}, where due to exploration the pessimistic policy set starting from $\Pi_{0}^{\mathrm{p} }$ expands to $\Pi_{{\bar{n}}}^{\mathrm{p} }$, and covers the connected true policy set $\Pi_{c}^{ \star ,\epsilon}$. Note that $\Pi_{c}^{ \star ,\epsilon}$ is a subset of $\Pi_{}^{ \star ,\epsilon}$, which is, in general, disconnected and thus cannot be discovered by executing only safe policies.
  • Figure 3: Visual illustration of the tolerances $\epsilon_d$ and $\epsilon_c$ used to define the dynamics exploration scheme. Trajectories generated under the sampled dynamics $\bm{f}^s$ and the true dynamics $\bm{f}^{\star}$ are shown, with the deviation $d_h$ measuring their discrepancy after horizon $h$. The arrows perpendicular to the trajectory manifold indicate the uncertainty, which exceeds $\epsilon_d$ and $\epsilon_c$ at the location of interest.
  • Figure 4: Illustration of SageDynX algorithm during a) exploration and b) convergence after satisfying the termination criteria. The cyan region denotes the safe set $\mathbb{X}_{n}$, and the green region represents the state constraints $\mathcal{X}$. The blue dashed line shows the optimal trajectory achieved by the clairvoyant agent starting anywhere in the safe set $\mathbb{X}_{n}$ and satisfying the constraints with $\epsilon$ margin. In \ref{['fig:dyn_receding_main']}, the shaded region shows the reachable set under the optimized pessimistic policy by \ref{['eq:slack_there_exists_goal']}, which starts at the current state ${x_{{}}}(k)$, ensures all the dynamics satisfy the constraints, and returns to the safe set. The black line shows the agent’s executed trajectory, with small dots marking collected data and large dots indicating the time of model updates. With every model update (increasing $n$), the reachable set (represented by increasing darker shades) predicted with a given policy $\pi$ shrinks since the model uncertainty reduces with data. The agent keeps on replanning while ensuring returnability to a known safe set, but without having to actually return. Once the termination criteria is satisfied, the agent returns to the safe set. As shown in \ref{['fig:dyn_receding_optimality_main']}, it then executes the returned policy which first navigates in the safe set for small $\delta h$ horizon (wiggly line), and then executes the optimized policy $\pi^{\mathrm{p}}_{}$ shown by black line which closely matches the optimal trajectory (blue dashed line).
  • Figure 5: Cumulative regret over time (averaged across runs) in different environments. SageDynX achieves an order of magnitude lower regret compared to the baselines in both experiments.
  • ...and 6 more figures

Theorems & Definitions (27)

  • Lemma 1: Well-calibrated model abbasi2013online
  • Remark 1
  • Remark 2
  • Theorem 1: Maximum safe dynamics exploration
  • Remark 3: Implementation
  • Remark 4: Task-oriented exploration with $J^{\mathrm{any}}({{{x_{{}}}_s, {n}}};{\pi})$
  • Theorem 2: Safe reward maximization with unknown dynamics
  • Corollary 1: Same horizon
  • Theorem 3: Sample complexity lower bound
  • Proposition 1
  • ...and 17 more