Table of Contents
Fetching ...

Learning Robust Reward Machines from Noisy Labels

Roko Parac, Lorenzo Nodari, Leo Ardon, Daniel Furelos-Blanco, Federico Cerutti, Alessandra Russo

TL;DR

Prob-IRM tackles learning robust reward machines from noisy traces and using them to guide RL. It interleaves RM learning via ILASP with policy optimization, leveraging a belief over RM states and probabilistic reward shaping to handle label noise. The approach generates WCDPIs from noisy traces, learns RMs tolerant to noise, and applies a belief-based RM exploitation mechanism to shape rewards and accelerate learning. Empirical results on OfficeWorld show that Prob-IRM can learn usable RMs from noisy data and achieve performance comparable to hand-crafted RMs, demonstrating robustness and practical applicability in imperfect sensing environments.

Abstract

This paper presents PROB-IRM, an approach that learns robust reward machines (RMs) for reinforcement learning (RL) agents from noisy execution traces. The key aspect of RM-driven RL is the exploitation of a finite-state machine that decomposes the agent's task into different subtasks. PROB-IRM uses a state-of-the-art inductive logic programming framework robust to noisy examples to learn RMs from noisy traces using the Bayesian posterior degree of beliefs, thus ensuring robustness against inconsistencies. Pivotal for the results is the interleaving between RM learning and policy learning: a new RM is learned whenever the RL agent generates a trace that is believed not to be accepted by the current RM. To speed up the training of the RL agent, PROB-IRM employs a probabilistic formulation of reward shaping that uses the posterior Bayesian beliefs derived from the traces. Our experimental analysis shows that PROB-IRM can learn (potentially imperfect) RMs from noisy traces and exploit them to train an RL agent to solve its tasks successfully. Despite the complexity of learning the RM from noisy traces, agents trained with PROB-IRM perform comparably to agents provided with handcrafted RMs.

Learning Robust Reward Machines from Noisy Labels

TL;DR

Prob-IRM tackles learning robust reward machines from noisy traces and using them to guide RL. It interleaves RM learning via ILASP with policy optimization, leveraging a belief over RM states and probabilistic reward shaping to handle label noise. The approach generates WCDPIs from noisy traces, learns RMs tolerant to noise, and applies a belief-based RM exploitation mechanism to shape rewards and accelerate learning. Empirical results on OfficeWorld show that Prob-IRM can learn usable RMs from noisy data and achieve performance comparable to hand-crafted RMs, demonstrating robustness and practical applicability in imperfect sensing environments.

Abstract

This paper presents PROB-IRM, an approach that learns robust reward machines (RMs) for reinforcement learning (RL) agents from noisy execution traces. The key aspect of RM-driven RL is the exploitation of a finite-state machine that decomposes the agent's task into different subtasks. PROB-IRM uses a state-of-the-art inductive logic programming framework robust to noisy examples to learn RMs from noisy traces using the Bayesian posterior degree of beliefs, thus ensuring robustness against inconsistencies. Pivotal for the results is the interleaving between RM learning and policy learning: a new RM is learned whenever the RL agent generates a trace that is believed not to be accepted by the current RM. To speed up the training of the RL agent, PROB-IRM employs a probabilistic formulation of reward shaping that uses the posterior Bayesian beliefs derived from the traces. Our experimental analysis shows that PROB-IRM can learn (potentially imperfect) RMs from noisy traces and exploit them to train an RL agent to solve its tasks successfully. Despite the complexity of learning the RM from noisy traces, agents trained with PROB-IRM perform comparably to agents provided with handcrafted RMs.
Paper Structure (26 sections, 12 equations, 5 figures, 1 algorithm)

This paper contains 26 sections, 12 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: An OfficeWorld instance (left) and a reward machine for the Coffee task (right), where o.w. stands for otherwise.
  • Figure 2: Learning curves for the three OfficeWorld tasks, where the upper row corresponds to the baseline agents provided with handcrafted RMs, and the lower row corresponds to the agents trained with Prob-IRM.
  • Figure 3: Learning curves for $\hbox{Prob-IRM}\xspace$ agents trained to solve the Coffee task in the noise-all scenario.
  • Figure 4: Learning curves for $\hbox{Prob-IRM}\xspace$ agents trained to solve the Coffee task in the noise-all scenario without reward shaping.
  • Figure 5: Learning curves for $\hbox{Prob-IRM}\xspace$ agents employing thresholding to solve the Coffee task under a noise-all setting with a posterior of 0.8.

Theorems & Definitions (9)

  • Example 1
  • Example 2
  • Example 3
  • Definition 1
  • Definition 2: Noisy labelling function
  • Definition 3: Noisy trace
  • Example 4: Generation of WCDPI from a noisy trace
  • Definition 4: RM state belief $\tilde{u}_t$
  • Example 5: Reward shaping in the OfficeWorld's Coffee task