Learning Reward Machines from Partially Observed Policies
Mohamad Louai Shehab, Antoine Aspeel, Necmiye Ozay
TL;DR
The paper presents a SAT-based framework to learn Reward Machines (RMs) from partially observed policies in labeled MDPs, addressing the challenge of unobserved rewards and RM states. By introducing a prefix tree policy to capture observable behavior and encoding RM structure, determinism, and negative examples into a SAT instance, it yields a minimal RM policy-equivalent to the true one when the depth is sufficiently large. The method extends to learning from demonstrations via robust MAX-SAT to mitigate mislabeled negatives and demonstrates effectiveness across grid worlds, a continuous robotic arm, and real mouse navigation data. The approach offers a principled, model-free path to recovering structured, temporally extended task representations with theoretical guarantees and practical performance benefits.
Abstract
Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy or demonstrations by an expert. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine. These results are further extended to the case where we only have access to demonstrations from an optimal policy. Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach.
