Table of Contents
Fetching ...

Bellman Value Decomposition for Task Logic in Safe Optimal Control

William Sharpless, Oswin So, Dylan Hirsch, Sylvia Herbert, Chuchu Fan

TL;DR

This work proves the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE.

Abstract

Real-world tasks involve nuanced combinations of goal and safety specifications. In high dimensions, the challenge is exacerbated: formal automata become cumbersome, and the combination of sparse rewards tends to require laborious tuning. In this work, we consider the innate structure of the Bellman Value as a means to naturally organize the problem for improved automatic performance. Namely, we prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE. To solve the Value and optimal policy, we propose VDPPO, which embeds the decomposed Value graph into a two-layer neural net, bootstrapping the implicit dependencies. We conduct a variety of simulated and hardware experiments to test our method on complex, high-dimensional tasks involving heterogeneous teams and nonlinear dynamics. Ultimately, we find this approach greatly improves performance over existing baselines, balancing safety and liveness automatically.

Bellman Value Decomposition for Task Logic in Safe Optimal Control

TL;DR

This work proves the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE.

Abstract

Real-world tasks involve nuanced combinations of goal and safety specifications. In high dimensions, the challenge is exacerbated: formal automata become cumbersome, and the combination of sparse rewards tends to require laborious tuning. In this work, we consider the innate structure of the Bellman Value as a means to naturally organize the problem for improved automatic performance. Namely, we prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE. To solve the Value and optimal policy, we propose VDPPO, which embeds the decomposed Value graph into a two-layer neural net, bootstrapping the implicit dependencies. We conduct a variety of simulated and hardware experiments to test our method on complex, high-dimensional tasks involving heterogeneous teams and nonlinear dynamics. Ultimately, we find this approach greatly improves performance over existing baselines, balancing safety and liveness automatically.
Paper Structure (46 sections, 41 theorems, 147 equations, 9 figures)

This paper contains 46 sections, 41 theorems, 147 equations, 9 figures.

Key Result

Lemma 1

Let $\mathsf{v}_\mathsf{p}$ be the predicate for $V[\mathsf{p}]$, i.e. $(\xi_x, t) \models \mathsf{v}_\mathsf{p} \iff V[\mathsf{p}](\xi_{x}(t)) \ge 0$. Recall that The following properties hold:

Figures (9)

  • Figure 1: Value-Decomposition and VDPPO. The Bellman Value for a range of temporal logic (e.g., multi-goal, recurrence, stability, safety) decomposes into a Value graph connected by atomic Bellman equations (Thms. 1–4). We propose VDPPO, an algorithm that exploits this structure to learn policies for complex, high-dimensional tasks. Our approach is validated on hardware with Herding and Delivery, two complex tasks involving a heterogeneous team of drones and a quadruped.
  • Figure 2: E.g. $N$-Until-Conjunction Value Decomposition. Here we illustrate the primary decomposition result (Thm. \ref{['thm:n-ra']} extension, Appendix), with a GridWorld example (left) for a given specification. The corresponding DVG is shown (center left) with each node representing a decomposed Value, and edges representing dependencies. In the center right, a subset of decomposed Values solved with dynamic programming are shown, along with the discounted solution produced by VDPPO. On the right, the optimal path for a given initial condition is shown.
  • Figure 3: E.g. $\mathsf{G}$($N$-Until-Conjunction) Value Decomposition. We illustrate the recursive decomposition result (Thm. \ref{['thm:n-ra-loop']}), with a GridWorld example (left) for a given specification. The plots here are analogous to those of Fig. \ref{['fig:nRAdemofig']}, with the DVG (center left), decomposed Values (center right), and optimal path (right). Note, the optimal path for the discounted case differs due to the subtle effect of discounting the Value associated with a $G$ composition, which selects for shorter loops (Sec. \ref{['sec:alwayscomp']}).
  • Figure 4: Graphical Depiction of Algorithms.
  • Figure 5: Performance scaling with TL complexity. Value decomposition enables VDPPO to better scale by tackling smaller problems.
  • ...and 4 more figures

Theorems & Definitions (60)

  • Definition 1
  • Definition 2
  • Definition 3
  • Remark 1
  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 2
  • Theorem 4
  • ...and 50 more