Table of Contents
Fetching ...

Zero-Shot Instruction Following in RL via Structured LTL Representations

Mattia Giuri, Mathias Jackermeier, Alessandro Abate

TL;DR

This work advances zero-shot instruction following in reinforcement learning by representing complex LTL tasks as sequences of Boolean formulas derived from Büchi automata. The policy is conditioned on these structured formulas via a graph neural network that processes formula graphs and a recurrent module that captures progress along an accepting run, enabling robust generalization when multiple propositions interact. The approach outperforms state-of-the-art baselines on a chess-inspired environment, particularly on infinite-horizon tasks, and demonstrates resilience to increasing task difficulty. The method offers a principled and scalable way to encode structured temporal specifications for generalist RL policies with potential for broader applicability beyond the tested domain.

Abstract

Linear temporal logic (LTL) is a compelling framework for specifying complex, structured tasks for reinforcement learning (RL) agents. Recent work has shown that interpreting LTL instructions as finite automata, which can be seen as high-level programs monitoring task progress, enables learning a single generalist policy capable of executing arbitrary instructions at test time. However, existing approaches fall short in environments where multiple high-level events (i.e., atomic propositions) can be true at the same time and potentially interact in complicated ways. In this work, we propose a novel approach to learning a multi-task policy for following arbitrary LTL instructions that addresses this shortcoming. Our method conditions the policy on sequences of simple Boolean formulae, which directly align with transitions in the automaton, and are encoded via a graph neural network (GNN) to yield structured task representations. Experiments in a complex chess-based environment demonstrate the advantages of our approach.

Zero-Shot Instruction Following in RL via Structured LTL Representations

TL;DR

This work advances zero-shot instruction following in reinforcement learning by representing complex LTL tasks as sequences of Boolean formulas derived from Büchi automata. The policy is conditioned on these structured formulas via a graph neural network that processes formula graphs and a recurrent module that captures progress along an accepting run, enabling robust generalization when multiple propositions interact. The approach outperforms state-of-the-art baselines on a chess-inspired environment, particularly on infinite-horizon tasks, and demonstrates resilience to increasing task difficulty. The method offers a principled and scalable way to encode structured temporal specifications for generalist RL policies with potential for broader applicability beyond the tested domain.

Abstract

Linear temporal logic (LTL) is a compelling framework for specifying complex, structured tasks for reinforcement learning (RL) agents. Recent work has shown that interpreting LTL instructions as finite automata, which can be seen as high-level programs monitoring task progress, enables learning a single generalist policy capable of executing arbitrary instructions at test time. However, existing approaches fall short in environments where multiple high-level events (i.e., atomic propositions) can be true at the same time and potentially interact in complicated ways. In this work, we propose a novel approach to learning a multi-task policy for following arbitrary LTL instructions that addresses this shortcoming. Our method conditions the policy on sequences of simple Boolean formulae, which directly align with transitions in the automaton, and are encoded via a graph neural network (GNN) to yield structured task representations. Experiments in a complex chess-based environment demonstrate the advantages of our approach.

Paper Structure

This paper contains 43 sections, 1 theorem, 20 equations, 6 figures, 6 tables, 1 algorithm.

Key Result

Theorem 3.1

Given an LTL formula $\varphi$, it is possible to automatically construct an LDBA $\mathcal{B}_\varphi$ such that $\mathcal{B}_\varphi$ accepts exactly the traces $\sigma$ for which $\sigma \models\varphi$.

Figures (6)

  • Figure 1: LDBA for the formula $(\mathsf{F}\,\mathsf{G}\, \mathsf a) \lor \mathsf{F}\, \mathsf (\mathsf b\land\mathsf{F}\, \mathsf c)$.
  • Figure 2: Given an LTL specification $\varphi$, we first construct an LDBA $\mathcal{B}_\varphi$ to represent the task structure and memory. We then extract a sequence of Boolean formulae from an accepting run in $\mathcal{B}_\varphi$ to condition our policy. If the LDBA state changes due to progress or external events, we recompute the accepting run from the new state $q'$ and re-condition the policy.
  • Figure 3: A sequence of Boolean formulae is processed via a GNN to produce a sequence of embeddings, which are passed through an RNN to obtain a representation of an accepting run. The MDP state is encoded via a state encoder. Both embeddings are fed into the policy network, which produces an action.
  • Figure 4: In the ChessWorld environment, the agent (the white king) must navigate the chessboard while avoiding certain squares attacked by black pieces. The agent can move along the 8 compass directions. Blue shading indicates squares attacked by at least one piece.
  • Figure 5: Average discounted returns $J(\pi)$ over the number of interaction steps during training. Shaded areas indicate 90% confidence intervals over 5 random seeds.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Theorem 3.1: sickert2016LimitDeterministic
  • Example 3.1
  • Example 4.1