Table of Contents
Fetching ...

Zero-Shot Instruction Following in RL via Structured LTL Representations

Mathias Jackermeier, Mattia Giuri, Jacques Cloete, Alessandro Abate

TL;DR

This work tackles zero-shot instruction following in multi-task reinforcement learning by using formal task specifications in linear temporal logic (LTL). The authors introduce StructLTL, which represents LTL instructions as sequences of Boolean formulae derived from transitions in Büchi automata (LDBA), and learns a structured task representation with a hierarchical DNF encoder and a temporal attention mechanism to reason about future subgoals. Trained with PPO and curriculum learning, StructLTL demonstrates strong generalisation to unseen LTL formulas and outperforms state-of-the-art baselines on ZoneEnv and Warehouse tasks, particularly on complex specifications. The method advances compositional generalisation for temporally extended RL tasks and offers a scalable approach for safe, instruction-following agents in high-dimensional environments, with avenues for real-world labelling function learning and safety-focused extensions.

Abstract

We study instruction following in multi-task reinforcement learning, where an agent must zero-shot execute novel tasks not seen during training. In this setting, linear temporal logic (LTL) has recently been adopted as a powerful framework for specifying structured, temporally extended tasks. While existing approaches successfully train generalist policies, they often struggle to effectively capture the rich logical and temporal structure inherent in LTL specifications. In this work, we address these concerns with a novel approach to learn structured task representations that facilitate training and generalisation. Our method conditions the policy on sequences of Boolean formulae constructed from a finite automaton of the task. We propose a hierarchical neural architecture to encode the logical structure of these formulae, and introduce an attention mechanism that enables the policy to reason about future subgoals. Experiments in a variety of complex environments demonstrate the strong generalisation capabilities and superior performance of our approach.

Zero-Shot Instruction Following in RL via Structured LTL Representations

TL;DR

This work tackles zero-shot instruction following in multi-task reinforcement learning by using formal task specifications in linear temporal logic (LTL). The authors introduce StructLTL, which represents LTL instructions as sequences of Boolean formulae derived from transitions in Büchi automata (LDBA), and learns a structured task representation with a hierarchical DNF encoder and a temporal attention mechanism to reason about future subgoals. Trained with PPO and curriculum learning, StructLTL demonstrates strong generalisation to unseen LTL formulas and outperforms state-of-the-art baselines on ZoneEnv and Warehouse tasks, particularly on complex specifications. The method advances compositional generalisation for temporally extended RL tasks and offers a scalable approach for safe, instruction-following agents in high-dimensional environments, with avenues for real-world labelling function learning and safety-focused extensions.

Abstract

We study instruction following in multi-task reinforcement learning, where an agent must zero-shot execute novel tasks not seen during training. In this setting, linear temporal logic (LTL) has recently been adopted as a powerful framework for specifying structured, temporally extended tasks. While existing approaches successfully train generalist policies, they often struggle to effectively capture the rich logical and temporal structure inherent in LTL specifications. In this work, we address these concerns with a novel approach to learn structured task representations that facilitate training and generalisation. Our method conditions the policy on sequences of Boolean formulae constructed from a finite automaton of the task. We propose a hierarchical neural architecture to encode the logical structure of these formulae, and introduce an attention mechanism that enables the policy to reason about future subgoals. Experiments in a variety of complex environments demonstrate the strong generalisation capabilities and superior performance of our approach.
Paper Structure (40 sections, 9 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 40 sections, 9 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: LDBA for the LTL formula $\mathsf{F}\, \mathsf{red} \lor (\mathsf{F}\,\mathsf{G}\, \mathsf{green})$.
  • Figure 2: Overview of our method. (Left) Given an LTL instruction $\varphi$, we construct an LDBA and extract sequences of Boolean formulae representing the current automaton state. (Right) The policy is conditioned on a single formula sequence. It processes formulae with a hierarchical DNF encoder and a temporal attention mechanism to reason about future subgoals. The latent task representation $z$ is concatenated with the encoded MDP state $s_t$ and mapped to an action $a_t$ via an actor MLP.
  • Figure 3: Evaluation curves on finite-horizon tasks over training. We report averages over the entire set of finite-horizon evaluation tasks, computed over 50 episodes per task. Shaded areas indicate 90% confidence intervals over 10 random seeds.
  • Figure 4: Visualisation of environments. (a) In ZoneEnv the agent has to navigate between zones of different colours. (b) In the Warehouse environment, the agent has to move crates and vases to different regions. In both environments, the agent receives lidar observations and outputs continuous actions for acceleration and steering.
  • Figure 5: Ablation studies. We report averages over finite-horizon evaluation tasks, computed over 50 episodes per task. Shaded areas indicate 90% confidence intervals over 10 random seeds.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Example 3.1
  • Example 4.1