Table of Contents
Fetching ...

Semantically Labelled Automata for Multi-Task Reinforcement Learning with LTL Instructions

Alessandro Abate, Giuseppe De Giacomo, Mathias Jackermeier, Jan Kretínský, Maximilian Prokop, Christoph Weinhuber

TL;DR

This work presents a novel task embedding technique leveraging a new generation of semantic LTL-to-automata translations, originally developed for temporal synthesis, which achieves state-of-the-art performance and is able to scale to complex specifications where existing methods fail.

Abstract

We study multi-task reinforcement learning (RL), a setting in which an agent learns a single, universal policy capable of generalising to arbitrary, possibly unseen tasks. We consider tasks specified as linear temporal logic (LTL) formulae, which are commonly used in formal methods to specify properties of systems, and have recently been successfully adopted in RL. In this setting, we present a novel task embedding technique leveraging a new generation of semantic LTL-to-automata translations, originally developed for temporal synthesis. The resulting semantically labelled automata contain rich, structured information in each state that allow us to (i) compute the automaton efficiently on-the-fly, (ii) extract expressive task embeddings used to condition the policy, and (iii) naturally support full LTL. Experimental results in a variety of domains demonstrate that our approach achieves state-of-the-art performance and is able to scale to complex specifications where existing methods fail.

Semantically Labelled Automata for Multi-Task Reinforcement Learning with LTL Instructions

TL;DR

This work presents a novel task embedding technique leveraging a new generation of semantic LTL-to-automata translations, originally developed for temporal synthesis, which achieves state-of-the-art performance and is able to scale to complex specifications where existing methods fail.

Abstract

We study multi-task reinforcement learning (RL), a setting in which an agent learns a single, universal policy capable of generalising to arbitrary, possibly unseen tasks. We consider tasks specified as linear temporal logic (LTL) formulae, which are commonly used in formal methods to specify properties of systems, and have recently been successfully adopted in RL. In this setting, we present a novel task embedding technique leveraging a new generation of semantic LTL-to-automata translations, originally developed for temporal synthesis. The resulting semantically labelled automata contain rich, structured information in each state that allow us to (i) compute the automaton efficiently on-the-fly, (ii) extract expressive task embeddings used to condition the policy, and (iii) naturally support full LTL. Experimental results in a variety of domains demonstrate that our approach achieves state-of-the-art performance and is able to scale to complex specifications where existing methods fail.
Paper Structure (57 sections, 1 theorem, 14 equations, 4 figures, 4 tables)

This paper contains 57 sections, 1 theorem, 14 equations, 4 figures, 4 tables.

Key Result

Theorem 1

Let $\mathcal{Q}$ be the set of states in the LDBA, $\textsf{AP}$ be the set of atomic propositions, and Paths be the set of all simple paths (i.e. paths without loops) starting from the current state $q\in\mathcal{Q}$. Let $C(n)$ denote the complexity of evaluating the value function on a sequence where $|\mathsf{Paths}|$ is upper-bounded by $e\cdot|\mathcal{Q}|!$.

Figures (4)

  • Figure 1: Schematic overview of our approach for the ZoneEnv environment (a robot navigating a 2D plane with zones of different colours) and the LTL task $\varphi=\operatorname{\mathbf{F}} \textbf{r}\textit{ed} \land \operatorname{\mathbf{FG}} \textbf{y}\textit{ellow}$. We first convert the task to an LDBA with semantic labelling (see Section \ref{['sec:semanticLabelling']}). Note that initially only $q_0$ is constructed. Then the semantic labelling of the current state $q_0$ is embedded and passed to the policy (see Section \ref{['sec:embedding']}). Together with the current state of the MDP $s_t$, the policy yields action $a$ in the MDP. The robot's trajectory in the MDP (green line) enters a zone with MDP label $\textbf{r}\textit{ed}$ which is used to update the automaton. Only then, the corresponding successor $q_1$ is constructed in the automaton.
  • Figure 2: Conveyor environment example. The task is either $\operatorname{\mathbf{F}} (\text{parcel}\land\operatorname{\mathbf{F}}\text{hammer})$ or $\operatorname{\mathbf{F}} (\text{parcel}\land\operatorname{\mathbf{F}}\text{wrench})$. Once the agent is on a conveyor belt, it cannot go back. Decomposition-based approaches can achieve a maximum success rate of 50%, whereas our approach finds the optimal solution.
  • Figure 3: Environment visualisations.
  • Figure 4: Sample trajectories produced by SemLTL for the ZoneEnv environment.

Theorems & Definitions (2)

  • Theorem 1
  • proof