DeepLTL: Learning to Efficiently Satisfy Complex LTL Specifications for Multi-Task RL

Mathias Jackermeier; Alessandro Abate

DeepLTL: Learning to Efficiently Satisfy Complex LTL Specifications for Multi-Task RL

Mathias Jackermeier, Alessandro Abate

TL;DR

DeepLTL addresses the challenge of zero-shot satisfaction of arbitrary LTL specifications in multi-task RL by exploiting the structure of limit-deterministic Büchi automata to reason about satisfaction paths. It represents LTL formulae as reach-avoid sequences and learns a sequence-conditioned policy using a DeepSets+RNN architecture that conditions actions on possible satisfaction paths. At test time, it selects the optimal reach-avoid sequence from the current automaton state to guide planning and action, enabling handling of infinite-horizon specifications and safety constraints. Empirical results across discrete and continuous domains show superior satisfaction probability and efficiency compared with state-of-the-art baselines, highlighting the practical impact of separating high-level temporal reasoning from low-level control for general LTL-conditioned RL.

Abstract

Linear temporal logic (LTL) has recently been adopted as a powerful formalism for specifying complex, temporally extended tasks in multi-task reinforcement learning (RL). However, learning policies that efficiently satisfy arbitrary specifications not observed during training remains a challenging problem. Existing approaches suffer from several shortcomings: they are often only applicable to finite-horizon fragments of LTL, are restricted to suboptimal solutions, and do not adequately handle safety constraints. In this work, we propose a novel learning approach to address these concerns. Our method leverages the structure of Büchi automata, which explicitly represent the semantics of LTL specifications, to learn policies conditioned on sequences of truth assignments that lead to satisfying the desired formulae. Experiments in a variety of discrete and continuous domains demonstrate that our approach is able to zero-shot satisfy a wide range of finite- and infinite-horizon specifications, and outperforms existing methods in terms of both satisfaction probability and efficiency. Code available at: https://deep-ltl.github.io/

DeepLTL: Learning to Efficiently Satisfy Complex LTL Specifications for Multi-Task RL

TL;DR

Abstract

Paper Structure (67 sections, 3 theorems, 25 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 67 sections, 3 theorems, 25 equations, 12 figures, 8 tables, 1 algorithm.

Introduction
Background
Reinforcement learning.
Linear temporal logic.
Büchi automata.
Problem setting
Method
Representing LTL specifications as sequences
Computing accepting cycles.
From paths to sequences.
Overview of the approach
Model architecture
Representing sets of assignments.
Representing reach-avoid sequences.
Training procedure
...and 52 more sections

Key Result

Theorem 1

For any $\gamma\in (0, 1)$ we have where $O_\pi = \mathop{\mathrm{\mathbb E}}\limits_{\varphi\sim\xi, \tau\sim\pi|\varphi}\bigl[ |\{q\in\tau_q : q\in\mathcal{F}_{\mathcal{B}_\varphi}\}|\,|\, \tau\not\models\varphi \bigr]$ is the expected number of visits to accepting states for trajectories that do not satisfy a specification.

Figures (12)

Figure 1: Limitations of existing methods, illustrated via trajectories in the ZoneEnv environment. The initial agent position is denoted as an orange diamond. (a) Most existing approaches cannot handle infinite-horizon tasks, such as $\mathsf{G}\,\mathsf{F}\,\mathsf{blue}\land\mathsf{G}\,\mathsf{F}\,\mathsf{green}$. (b) Given the formula $\mathsf{F}\, (\mathsf{blue}\land \mathsf{F}\, \mathsf{green})$, a myopic approach produces a suboptimal solution (orange line). We prefer the more efficient green trajectory. (c) Given the task $(\mathsf{F}\,\mathsf{green}\lor\mathsf{F}\,\mathsf{yellow})\land\mathsf{G}\,\neg\mathsf{blue}$, the agent should aim to reach the yellow region instead of the green region, since this is the safer option. Many existing approaches are unable to perform this sort of planning.
Figure 2: Overview of our approach. (Left) During training, we train a general sequence-conditioned policy with reach-avoid sequences $\sigma$. (Right) At test time, we construct an LDBA from the target specification $\varphi$. We then select the optimal reach-avoid sequence $\sigma^*$ for the current LDBA state $q$ according to the value function $V^\pi(s,\sigma)$, and produce an action $a$ from the policy conditioned on $\sigma^*$.
Figure 3: Illustration of the sequence module. The positive and negative assignments in the truncated reach-avoid sequence $\sigma$ are encoded using the DeepSets architecture, which produces embeddings $\bm e_A$. These are then concatenated and passed through an RNN, which yields the final sequence representation $\bm e_\sigma$.
Figure 4: Evaluation curves on reach/avoid specifications. Each datapoint is collected by averaging the discounted return of the policy across 50 episodes with randomly sampled tasks, and shaded areas indicate 90% confidence intervals over 5 different random seeds.
Figure 5: Results on infinite-horizon tasks. (a), (b) Example trajectories for infinite-horizon specifications. (c) Performance on various recurrence tasks. We report the average number of visits to accepting states over 500 episodes (i.e. completed cycles), with standard deviations over 5 seeds.
...and 7 more figures

Theorems & Definitions (10)

Example 1
Definition 1: Product MDP
Theorem 1: Corollary of voloshin2023Eventual, Theorem 4.2
proof
Remark
Example 2
Lemma 1
proof
Theorem 1
proof

DeepLTL: Learning to Efficiently Satisfy Complex LTL Specifications for Multi-Task RL

TL;DR

Abstract

DeepLTL: Learning to Efficiently Satisfy Complex LTL Specifications for Multi-Task RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (10)