Learning Robust Reward Machines from Noisy Labels
Roko Parac, Lorenzo Nodari, Leo Ardon, Daniel Furelos-Blanco, Federico Cerutti, Alessandra Russo
TL;DR
Prob-IRM tackles learning robust reward machines from noisy traces and using them to guide RL. It interleaves RM learning via ILASP with policy optimization, leveraging a belief over RM states and probabilistic reward shaping to handle label noise. The approach generates WCDPIs from noisy traces, learns RMs tolerant to noise, and applies a belief-based RM exploitation mechanism to shape rewards and accelerate learning. Empirical results on OfficeWorld show that Prob-IRM can learn usable RMs from noisy data and achieve performance comparable to hand-crafted RMs, demonstrating robustness and practical applicability in imperfect sensing environments.
Abstract
This paper presents PROB-IRM, an approach that learns robust reward machines (RMs) for reinforcement learning (RL) agents from noisy execution traces. The key aspect of RM-driven RL is the exploitation of a finite-state machine that decomposes the agent's task into different subtasks. PROB-IRM uses a state-of-the-art inductive logic programming framework robust to noisy examples to learn RMs from noisy traces using the Bayesian posterior degree of beliefs, thus ensuring robustness against inconsistencies. Pivotal for the results is the interleaving between RM learning and policy learning: a new RM is learned whenever the RL agent generates a trace that is believed not to be accepted by the current RM. To speed up the training of the RL agent, PROB-IRM employs a probabilistic formulation of reward shaping that uses the posterior Bayesian beliefs derived from the traces. Our experimental analysis shows that PROB-IRM can learn (potentially imperfect) RMs from noisy traces and exploit them to train an RL agent to solve its tasks successfully. Despite the complexity of learning the RM from noisy traces, agents trained with PROB-IRM perform comparably to agents provided with handcrafted RMs.
