Table of Contents
Fetching ...

Expressive Temporal Specifications for Reward Monitoring

Omar Adalat, Francesco Belardinelli

TL;DR

This work tackles sparse, non-informative rewards in reinforcement learning by leveraging quantitative Linear Temporal Logic on finite traces to synthesize reward monitors that emit dense, real-valued feedback during runtime. By constructing a synchronous product of the environment MDP with these monitors, the approach enables non-Markovian, temporally extended goals while preserving the sufficiency of Markovian policies. The authors provide a memoized, linear-time monitor synthesis procedure, demonstrate safety-aware reward clamping, and empirically show that quantitative monitors often outperform Boolean monitors and can match or exceed handcrafted rewards across a range of environments. The method promises improved sample efficiency and reliable convergence for long-horizon tasks, with broad applicability due to its algorithm-agnostic design and domain-agnostic formulation of monitors. Future work points toward richer temporal operators, discounted rewards, robustness, and multi-agent extensions.

Abstract

Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (($\text{LTL}_f[\mathcal{F}]$)) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.

Expressive Temporal Specifications for Reward Monitoring

TL;DR

This work tackles sparse, non-informative rewards in reinforcement learning by leveraging quantitative Linear Temporal Logic on finite traces to synthesize reward monitors that emit dense, real-valued feedback during runtime. By constructing a synchronous product of the environment MDP with these monitors, the approach enables non-Markovian, temporally extended goals while preserving the sufficiency of Markovian policies. The authors provide a memoized, linear-time monitor synthesis procedure, demonstrate safety-aware reward clamping, and empirically show that quantitative monitors often outperform Boolean monitors and can match or exceed handcrafted rewards across a range of environments. The method promises improved sample efficiency and reliable convergence for long-horizon tasks, with broad applicability due to its algorithm-agnostic design and domain-agnostic formulation of monitors. Future work points toward richer temporal operators, discounted rewards, robustness, and multi-agent extensions.

Abstract

Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (()) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.

Paper Structure

This paper contains 36 sections, 5 theorems, 21 equations, 2 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

The state and transition overhead of quantitative reward monitor construction from a ltl$_f$$[\mathcal{F}]$ formula $\varphi$ is linear with respect to the size of $\varphi$.

Figures (2)

  • Figure 1: A boolean monitor with ltl$_f$ semantics (left-hand side) and quantitative monitor with ltl$_f$$[\mathcal{F}]$ semantics (right-hand side) constructed for the same formula: $\lnot a \: \mathcal{U}\xspace \: (a \land F b)$.
  • Figure 2: Task-completion percentages across safety-gridworlds and toy environments, $\pm$ 1 std. Also detailed in Table \ref{['table:cw']}.

Theorems & Definitions (22)

  • Definition 1: ltl$_f$$[\mathcal{F}]$ syntax
  • Definition 2: Safe- ltl$_f$
  • Definition 3: Semantics of ltl$_f$$[\mathcal{F}]$
  • Definition 4: BRM
  • Definition 5: Specification-reward pairs
  • Definition 6: QRM
  • Theorem 1
  • Lemma 2
  • Theorem 3: Correctness
  • Definition 7: Extended MDP
  • ...and 12 more