Table of Contents
Fetching ...

Learning Reward Machines from Partially Observed Policies

Mohamad Louai Shehab, Antoine Aspeel, Necmiye Ozay

TL;DR

The paper presents a SAT-based framework to learn Reward Machines (RMs) from partially observed policies in labeled MDPs, addressing the challenge of unobserved rewards and RM states. By introducing a prefix tree policy to capture observable behavior and encoding RM structure, determinism, and negative examples into a SAT instance, it yields a minimal RM policy-equivalent to the true one when the depth is sufficiently large. The method extends to learning from demonstrations via robust MAX-SAT to mitigate mislabeled negatives and demonstrates effectiveness across grid worlds, a continuous robotic arm, and real mouse navigation data. The approach offers a principled, model-free path to recovering structured, temporally extended task representations with theoretical guarantees and practical performance benefits.

Abstract

Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy or demonstrations by an expert. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine. These results are further extended to the case where we only have access to demonstrations from an optimal policy. Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach.

Learning Reward Machines from Partially Observed Policies

TL;DR

The paper presents a SAT-based framework to learn Reward Machines (RMs) from partially observed policies in labeled MDPs, addressing the challenge of unobserved rewards and RM states. By introducing a prefix tree policy to capture observable behavior and encoding RM structure, determinism, and negative examples into a SAT instance, it yields a minimal RM policy-equivalent to the true one when the depth is sufficiently large. The method extends to learning from demonstrations via robust MAX-SAT to mitigate mislabeled negatives and demonstrates effectiveness across grid worlds, a continuous robotic arm, and real mouse navigation data. The approach offers a principled, model-free path to recovering structured, temporally extended task representations with theoretical guarantees and practical performance benefits.

Abstract

Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy or demonstrations by an expert. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine. These results are further extended to the case where we only have access to demonstrations from an optimal policy. Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach.

Paper Structure

This paper contains 38 sections, 4 theorems, 31 equations, 12 figures, 7 tables, 4 algorithms.

Key Result

Lemma 1

Let $\sigma, \sigma' \in (\mathrm{AP})^*$ be two finite label sequences. If $\pi_{\mathrm{PTP}}^{\mathrm{true}}(a|s, \sigma) \neq \pi_{\mathrm{PTP}}^{\mathrm{true}}(a|s, \sigma')$, then $\delta_\textbf{u}^*(u_I,\sigma) \neq \delta_\textbf{u}^*(u_I,\sigma')$.

Figures (12)

  • Figure 1: (a) The room grid world. (b) The patrol reward machine. (c) The room grid world with a hallway.
  • Figure 2: (a): Block World MDP. The left-most stacking configuration has label $\mathbf{st_1}$, where all blocks are stacked on the first pile with green being under yellow and yellow being under red. Similarly, the middle configuration has label $\mathbf{st_2}$ and the right-most configuration has label $\mathbf{st_3}$. (b): Stacking Reward Machine.
  • Figure 3: (a) The block stacking configuration with label ${\color{blue}\mathbf{st_{bd}}}$ that we want our robot to avoid. (b) The ground truth reward machine. (c) Smaller consistent reward machine with the task stack-avoid.
  • Figure 4: Reacher experiment. (a): 2-link robotic arm with the three colored targets. (b) First recovered reward machine model. (c) Second recovered reward machine model.
  • Figure 5: Labyrinth experiment. (a): Maze structure and state space definition. (b): Trajectory of a single mouse. (c): Recovered reward machine. (a) and (b) are reprinted from rosenberg2021mice.
  • ...and 7 more figures

Theorems & Definitions (16)

  • Definition 1
  • Lemma 1
  • proof
  • Remark 1
  • Remark 2
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Theorem 1
  • ...and 6 more