Table of Contents
Fetching ...

Maximally Permissive Reward Machines

Giovanni Varricchione, Natasha Alechina, Mehdi Dastani, Brian Logan

TL;DR

The paper tackles learning rewards for temporally extended tasks by constructing maximally permissive reward machines (MPRMs) from the entire set of partial-order plans for a task. MPRMs track prefixes of plan linearisations and provide rewards that encourage completing a plan, yielding policies at least as good as those learned with reward machines based on a single plan, under goal-adequate planning domains. The authors prove theoretical guarantees comparing RM variants and validate them empirically in CraftWorld, where MPRMs typically yield higher rewards albeit with slower convergence due to increased flexibility. This approach offers a principled way to encode broad planning-based guidance into reinforcement learning, improving policy quality for complex tasks in discrete environments and suggesting promising directions for scalable planning-based RL.

Abstract

Reward machines allow the definition of rewards for temporally extended tasks and behaviors. Specifying "informative" reward machines can be challenging. One way to address this is to generate reward machines from a high-level abstract description of the learning environment, using techniques such as AI planning. However, previous planning-based approaches generate a reward machine based on a single (sequential or partial-order) plan, and do not allow maximum flexibility to the learning agent. In this paper we propose a new approach to synthesising reward machines which is based on the set of partial order plans for a goal. We prove that learning using such "maximally permissive" reward machines results in higher rewards than learning using RMs based on a single plan. We present experimental results which support our theoretical claims by showing that our approach obtains higher rewards than the single-plan approach in practice.

Maximally Permissive Reward Machines

TL;DR

The paper tackles learning rewards for temporally extended tasks by constructing maximally permissive reward machines (MPRMs) from the entire set of partial-order plans for a task. MPRMs track prefixes of plan linearisations and provide rewards that encourage completing a plan, yielding policies at least as good as those learned with reward machines based on a single plan, under goal-adequate planning domains. The authors prove theoretical guarantees comparing RM variants and validate them empirically in CraftWorld, where MPRMs typically yield higher rewards albeit with slower convergence due to increased flexibility. This approach offers a principled way to encode broad planning-based guidance into reinforcement learning, improving policy quality for complex tasks in discrete environments and suggesting promising directions for scalable planning-based RL.

Abstract

Reward machines allow the definition of rewards for temporally extended tasks and behaviors. Specifying "informative" reward machines can be challenging. One way to address this is to generate reward machines from a high-level abstract description of the learning environment, using techniques such as AI planning. However, previous planning-based approaches generate a reward machine based on a single (sequential or partial-order) plan, and do not allow maximum flexibility to the learning agent. In this paper we propose a new approach to synthesising reward machines which is based on the set of partial order plans for a goal. We prove that learning using such "maximally permissive" reward machines results in higher rewards than learning using RMs based on a single plan. We present experimental results which support our theoretical claims by showing that our approach obtains higher rewards than the single-plan approach in practice.
Paper Structure (11 sections, 2 theorems, 8 equations, 5 figures, 1 algorithm)

This paper contains 11 sections, 2 theorems, 8 equations, 5 figures, 1 algorithm.

Key Result

Theorem 2.1

Let $\mathcal{M}$ be a labelled MDP, $\mathcal{D}$ a planning domain over $\mathcal{M}$, and $\text{RM-}\overline{\Pi}\xspace$, $\text{RM-}\overline{\pi}\xspace$ and $\text{RM-}\pi$ final state reward machines generated from $\mathcal{D}$ for the same task. Then, where $\rho_1 \geq \rho_2$ if and only if $v(\rho_1(s)) \geq v(\rho_2(s))$ for all states $s \in S$ of $\mathcal{M}$.

Figures (5)

  • Figure 1: MPRM for the bridge task. Positive and negative postconditions are respectively denoted with a superscript $+$ and $-$.
  • Figure 2: Results for the bridge task.
  • Figure 3: Results for the gold task.
  • Figure 4: Results for the gold-or-gem task.
  • Figure 5: Illustration of the behaviour of the MPRM and POP-trained agents on the gold-or-gem task.

Theorems & Definitions (9)

  • Definition 1.1: Labelled MDP
  • Definition 1.2: Reward Machine
  • Definition 1.3: MDPRM
  • Definition 1.4: Set of all partial-order plans
  • Theorem 2.1
  • proof
  • Definition 2.2
  • Theorem 2.3
  • proof