Table of Contents
Fetching ...

Reward Machines for Deep RL in Noisy and Uncertain Environments

Andrew C. Li, Zizhao Chen, Toryn Q. Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila A. McIlraith

TL;DR

This work extends Reward Machines to deep RL under uncertain domain vocabulary by framing the problem as a POMDP and introducing a Noisy Reward Machine Environment. It proposes three RM-state inference modules—Naive, Independent Belief Updating, and Temporal Dependency Modelling—to leverage RM structure with noisy abstractions, showing that only TDM is consistently reliable in partially observable settings. Theoretical results establish equivalence to POMDPs and highlight when abstraction choices matter, while experiments across Traffic Light, Kitchen, Colour Matching, and Gold Mining demonstrate that TDM achieves Oracle-like performance and improves sample efficiency. The findings suggest that task structure, when paired with temporally aware belief modelling and even zero-shot abstractions from foundation models, can robustly guide learning in real-world, uncertain environments.

Abstract

Reward Machines provide an automaton-inspired structure for specifying instructions, safety constraints, and other temporally extended reward-worthy behaviour. By exposing the underlying structure of a reward function, they enable the decomposition of an RL task, leading to impressive gains in sample efficiency. Although Reward Machines and similar formal specifications have a rich history of application towards sequential decision-making problems, they critically rely on a ground-truth interpretation of the domain-specific vocabulary that forms the building blocks of the reward function--such ground-truth interpretations are elusive in the real world due in part to partial observability and noisy sensing. In this work, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that exploit task structure under uncertain interpretation of the domain-specific vocabulary. Through theory and experiments, we expose pitfalls in naive approaches to this problem while simultaneously demonstrating how task structure can be successfully leveraged under noisy interpretations of the vocabulary.

Reward Machines for Deep RL in Noisy and Uncertain Environments

TL;DR

This work extends Reward Machines to deep RL under uncertain domain vocabulary by framing the problem as a POMDP and introducing a Noisy Reward Machine Environment. It proposes three RM-state inference modules—Naive, Independent Belief Updating, and Temporal Dependency Modelling—to leverage RM structure with noisy abstractions, showing that only TDM is consistently reliable in partially observable settings. Theoretical results establish equivalence to POMDPs and highlight when abstraction choices matter, while experiments across Traffic Light, Kitchen, Colour Matching, and Gold Mining demonstrate that TDM achieves Oracle-like performance and improves sample efficiency. The findings suggest that task structure, when paired with temporally aware belief modelling and even zero-shot abstractions from foundation models, can robustly guide learning in real-world, uncertain environments.

Abstract

Reward Machines provide an automaton-inspired structure for specifying instructions, safety constraints, and other temporally extended reward-worthy behaviour. By exposing the underlying structure of a reward function, they enable the decomposition of an RL task, leading to impressive gains in sample efficiency. Although Reward Machines and similar formal specifications have a rich history of application towards sequential decision-making problems, they critically rely on a ground-truth interpretation of the domain-specific vocabulary that forms the building blocks of the reward function--such ground-truth interpretations are elusive in the real world due in part to partial observability and noisy sensing. In this work, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that exploit task structure under uncertain interpretation of the domain-specific vocabulary. Through theory and experiments, we expose pitfalls in naive approaches to this problem while simultaneously demonstrating how task structure can be successfully leveraged under noisy interpretations of the vocabulary.
Paper Structure (27 sections, 11 theorems, 6 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 11 theorems, 6 equations, 9 figures, 3 tables, 1 algorithm.

Key Result

Theorem 4.1

A Noisy RM Environment $\langle \mathcal{E}, \mathcal{R}, \mathcal{L}, \mathcal{M} \rangle$ is equivalent to a POMDP over state space $S \times U$ and observation space $O$.

Figures (9)

  • Figure 1: The Noisy Reward Machine Environment framework. Blue elements highlight differences with respect to a standard RL framework. Dashed lines ([0.5ex]1cm1pt1mm) indicate that an element is required during training but not deployment.
  • Figure 1: On-policy RL that decouples RM state inference using an abstraction model $\mathcal{M}$ and decision making.
  • Figure 2: The Gold Mining Problem is a Noisy RM Environment where the agent's interpretation of the vocabulary is uncertain. Left: The four rightmost cells yield gold ( ) while two cells in the second column yield iron pyrite, which has no value. The agent cannot reliably distinguish between the two metals---cells are labelled with the probability the agent believes it yields gold. Right: The RM emits a (non-Markovian) reward of 1 for collecting gold and delivering it to the depot ().
  • Figure 3: Traffic Light (top left) and Kitchen (bottom left), are MiniGrids with image observations, where key propositions are partially observable. Colour Matching (right) is a MuJoCo robotics environment where the agent must identify colour names by their RGB values to solve a task.
  • Figure 4: RL curves averaged over 8 runs (shaded regions show standard error). TDM performs well in all domains, in the absence of the ground-truth labelling function, while Recurrent PPO fails.
  • ...and 4 more figures

Theorems & Definitions (22)

  • Theorem 4.1
  • Theorem 4.2: Does the choice of $\mathcal{M}$ affect optimal behaviour?
  • Theorem 4.3: Does observing $\mathcal{L}$ affect optimal behaviour?
  • Example 5.1
  • Example 5.2
  • Definition 5.3: Consistency
  • Theorem A.1
  • proof
  • Theorem A.1: Does the choice of $\mathcal{M}$ affect optimal behaviour?
  • proof
  • ...and 12 more