Scaling Learning based Policy Optimization for Temporal Logic Tasks by Controller Network Dropout

Navid Hashemi; Bardh Hoxha; Danil Prokhorov; Georgios Fainekos; Jyotirmoy Deshmukh

Scaling Learning based Policy Optimization for Temporal Logic Tasks by Controller Network Dropout

Navid Hashemi, Bardh Hoxha, Danil Prokhorov, Georgios Fainekos, Jyotirmoy Deshmukh

TL;DR

The paper tackles training neural network-based feedback controllers to enforce DT-STL specifications in nonlinear, long-horizon tasks by maximizing a DT-STL robustness objective. It introduces a gradient sampling technique, inspired by dropout and stochastic depth, plus critical-predicate-aware time sampling and safe re-smoothing to stabilize backpropagation across hundreds to thousands of steps. The approach is demonstrated on high-dimensional systems (12D quad-rotor, 20D multi-agent Dubins cars, 6D quad-rotor with moving platform), showing improved scalability and robustness to noise versus open-loop methods. A probabilistic verification framework based on conformal prediction provides high-confidence guarantees for satisfaction on unseen initial conditions. Overall, the method enables scalable synthesis of NN feedback controllers that satisfy complex temporal logic objectives in long-horizon, high-dimensional settings.

Abstract

This paper introduces a model-based approach for training feedback controllers for an autonomous agent operating in a highly nonlinear (albeit deterministic) environment. We desire the trained policy to ensure that the agent satisfies specific task objectives and safety constraints, both expressed in Discrete-Time Signal Temporal Logic (DT-STL). One advantage for reformulation of a task via formal frameworks, like DT-STL, is that it permits quantitative satisfaction semantics. In other words, given a trajectory and a DT-STL formula, we can compute the {\em robustness}, which can be interpreted as an approximate signed distance between the trajectory and the set of trajectories satisfying the formula. We utilize feedback control, and we assume a feed forward neural network for learning the feedback controller. We show how this learning problem is similar to training recurrent neural networks (RNNs), where the number of recurrent units is proportional to the temporal horizon of the agent's task objectives. This poses a challenge: RNNs are susceptible to vanishing and exploding gradients, and naïve gradient descent-based strategies to solve long-horizon task objectives thus suffer from the same problems. To tackle this challenge, we introduce a novel gradient approximation algorithm based on the idea of dropout or gradient sampling. One of the main contributions is the notion of {\em controller network dropout}, where we approximate the NN controller in several time-steps in the task horizon by the control input obtained using the controller in a previous training step. We show that our control synthesis methodology, can be quite helpful for stochastic gradient descent to converge with less numerical issues, enabling scalable backpropagation over long time horizons and trajectories over high dimensional state spaces.

Scaling Learning based Policy Optimization for Temporal Logic Tasks by Controller Network Dropout

TL;DR

Abstract

Paper Structure (18 sections, 1 theorem, 27 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 1 theorem, 27 equations, 12 figures, 4 tables, 1 algorithm.

Introduction
Organization and Notations
Preliminaries
Training Neural Network Control Policies
Extension to Long Horizon Temporal Tasks & Higher Dimensional Systems
Sampling-Based Gradient Approximation Technique
Including the Critical Predicate in Time Sampling
Safe Re-Smoothing
Computing the Sampled Gradient
Experimental Evaluation
12-dimensional Quad-rotor (Nested 3-Future Formula)
Multi-Agent: Network of Dubins Cars (Nested Formula)
6-dimensional Quad-rotor & Moving Platform: Landing a Quad-rotor
Influence of waypoint function, critical predicate and time sampling on Algorithm \ref{['algo:training']}
Dubins Car: Growing Task Horizon for Dubins Car (Ablation study on time sampling)
...and 3 more sections

Key Result

Lemma 7

Consider $m$ independent and identically distributed (i.i.d.), real-valued data points drawn from some distribution $\mathcal{D}$. After they are drawn, suppose we sort them in ascending order and denote the $i^{th}$ smallest number by $R_i$,(i.e., we have $R_1 < R_2 < \ldots < R_m$). Let $\mathbf{B

Figures (12)

Figure 1: Shows an illustration of the recurrent structure for the control feedback system.
Figure 2: This figure shows the symbolic trajectory generated by NN feedback controller, and the computation graph for DT-STL robustness. The DT-STL robustness is presented as a Nero-symbolic computation graph hashemi2023neurosymbolic via $\mathsf{ReLU}$ and Linear activation functions.
Figure 3: this figure shows a common challenge in using critical predicate for control synthesis. This figure presents the robustness as a piece-wise differentiable function of control parameter $\mathbf{\theta}$ (with resolution, $0.00001$), where each differentiable segment represent a distinct critical predicate.
Figure 4: This figure shows an example for the relation between control parameters and the resulting robustness as a piece-wise differentiable function. Assuming a fixed initial state, every control parameter is corresponding to a simulated trajectory, and that trajectory represents a robustness value. This robustness value is equal to the quantitative semantics for the critical predicate. Within each differentiable segment in this plot, the control parameters yield trajectories associated with a unique critical predicate.
Figure 5: This figure depicts the sampling-based gradient computation. In our approach, we freeze the controller at some time-points, while at others we assume the controller to be a function of its parameters that can vary in this iteration of back-propagation process. The actions that are fixed are highlighted in red, whereas the dependent actions are denoted in black. The red circles represent the time-steps where the controller is frozen.
...and 7 more figures

Theorems & Definitions (12)

Example 1
Remark 1
Definition 2: Sub-trajectory & Sampled trajectory
Remark 3
Example 2
Definition 4: Critical Predicate
Example 3
Example 4
Remark 5
Example 5
...and 2 more

Scaling Learning based Policy Optimization for Temporal Logic Tasks by Controller Network Dropout

TL;DR

Abstract

Scaling Learning based Policy Optimization for Temporal Logic Tasks by Controller Network Dropout

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (12)