Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning (Extended Version)

Pierriccardo Olivieri; Fausto Lasca; Alessandro Gianola; Matteo Papini

Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning (Extended Version)

Pierriccardo Olivieri, Fausto Lasca, Alessandro Gianola, Matteo Papini

TL;DR

This work introduces Linear Temporal Logic Modulo Theories over finite traces ($\text{LTLfMT}$) to specify non-Markovian rewards in large MDPs by treating atomic predicates as first-order formulas over theories. It identifies a tractable, lookahead-free fragment and provides a practical translation to a DFA and product MDP, enabling RL in continuous domains with theories such as Non-Linear Real Arithmetic (NRA). To combat reward sparsity, the authors combine Counterfactual Experiences for Reward Machines (CRM) with Hindsight Experience Replay (HER) within the product MDP, demonstrating improved learning in parking and reacher tasks. Empirical results show CRM-HER achieves robust performance across tasks of varying complexity, highlighting the approach's potential for expressive, low-encoding reward specifications and efficient learning. The work paves the way for integrating symbolic, first-order specifications with continuous RL and suggests future extensions to hierarchical RL and formal verification.

Abstract

In this work, we propose a novel framework for the logical specification of non-Markovian rewards in Markov Decision Processes (MDPs) with large state spaces. Our approach leverages Linear Temporal Logic Modulo Theories over finite traces (LTLfMT), a more expressive extension of classical temporal logic in which predicates are first-order formulas of arbitrary first-order theories rather than simple Boolean variables. This enhanced expressiveness enables the specification of complex tasks over unstructured and heterogeneous data domains, promoting a unified and reusable framework that eliminates the need for manual predicate encoding. However, the increased expressive power of LTLfMT introduces additional theoretical and computational challenges compared to standard LTLf specifications. We address these challenges from a theoretical standpoint, identifying a fragment of LTLfMT that is tractable but sufficiently expressive for reward specification in an infinite-state-space context. From a practical perspective, we introduce a method based on reward machines and Hindsight Experience Replay (HER) to translate first-order logic specifications and address reward sparsity. We evaluate this approach to a continuous-control setting using Non-Linear Arithmetic Theory, showing that it enables natural specification of complex tasks. Experimental results show how a tailored implementation of HER is fundamental in solving tasks with complex goals.

Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning (Extended Version)

TL;DR

This work introduces Linear Temporal Logic Modulo Theories over finite traces (

) to specify non-Markovian rewards in large MDPs by treating atomic predicates as first-order formulas over theories. It identifies a tractable, lookahead-free fragment and provides a practical translation to a DFA and product MDP, enabling RL in continuous domains with theories such as Non-Linear Real Arithmetic (NRA). To combat reward sparsity, the authors combine Counterfactual Experiences for Reward Machines (CRM) with Hindsight Experience Replay (HER) within the product MDP, demonstrating improved learning in parking and reacher tasks. Empirical results show CRM-HER achieves robust performance across tasks of varying complexity, highlighting the approach's potential for expressive, low-encoding reward specifications and efficient learning. The work paves the way for integrating symbolic, first-order specifications with continuous RL and suggests future extensions to hierarchical RL and formal verification.

Abstract

Paper Structure (35 sections, 10 equations, 3 figures, 2 tables)

This paper contains 35 sections, 10 equations, 3 figures, 2 tables.

Introduction
Contributions and paper structure.
Related Works
Propositional languages equipped with labeling functions.
Existing remedies to sparse rewards.
Preliminaries
Reinforcement learning.
Linear temporal logic over finite traces.
Reward Specification via LTLfMT
Linear Temporal Logic Modulo Theories with finite traces
Fragment for Reward Specification in RL
Translation to DFA and Product MDP
Reinforcement Learning with LTLfMT Rewards
Continuous Control Case Study
Addressing Sparsity in the Product MDP
...and 20 more sections

Figures (3)

Figure 1: Success rate of each baselines for parking environment in 4 different tasks, with 95% bootstrap confidence interval over 20 independent experiments. The x-axis reports the training steps of DDPG. Task names are specified above each plot and are ordered left-to-right by task complexity.
Figure 2: Regret and cumulative reward plot for each task on Parking environment.
Figure 3: Success rate, cumulative regret and cumulative reward on each task for the reacher environment.

Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning (Extended Version)

TL;DR

Abstract

Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning (Extended Version)

Authors

TL;DR

Abstract

Table of Contents

Figures (3)