Maximally Permissive Reward Machines

Giovanni Varricchione; Natasha Alechina; Mehdi Dastani; Brian Logan

Maximally Permissive Reward Machines

Giovanni Varricchione, Natasha Alechina, Mehdi Dastani, Brian Logan

TL;DR

The paper tackles learning rewards for temporally extended tasks by constructing maximally permissive reward machines (MPRMs) from the entire set of partial-order plans for a task. MPRMs track prefixes of plan linearisations and provide rewards that encourage completing a plan, yielding policies at least as good as those learned with reward machines based on a single plan, under goal-adequate planning domains. The authors prove theoretical guarantees comparing RM variants and validate them empirically in CraftWorld, where MPRMs typically yield higher rewards albeit with slower convergence due to increased flexibility. This approach offers a principled way to encode broad planning-based guidance into reinforcement learning, improving policy quality for complex tasks in discrete environments and suggesting promising directions for scalable planning-based RL.

Abstract

Reward machines allow the definition of rewards for temporally extended tasks and behaviors. Specifying "informative" reward machines can be challenging. One way to address this is to generate reward machines from a high-level abstract description of the learning environment, using techniques such as AI planning. However, previous planning-based approaches generate a reward machine based on a single (sequential or partial-order) plan, and do not allow maximum flexibility to the learning agent. In this paper we propose a new approach to synthesising reward machines which is based on the set of partial order plans for a goal. We prove that learning using such "maximally permissive" reward machines results in higher rewards than learning using RMs based on a single plan. We present experimental results which support our theoretical claims by showing that our approach obtains higher rewards than the single-plan approach in practice.

Maximally Permissive Reward Machines

TL;DR

Abstract

Paper Structure (11 sections, 2 theorems, 8 equations, 5 figures, 1 algorithm)

This paper contains 11 sections, 2 theorems, 8 equations, 5 figures, 1 algorithm.

Preliminaries
Reinforcement Learning
Labelled MDPs
Reward Machines
Symbolic Planning
Maximally Permissive Reward Machines
Empirical Evaluation
Experimental Setup
Results
Related Work
Conclusions

Key Result

Theorem 2.1

Let $\mathcal{M}$ be a labelled MDP, $\mathcal{D}$ a planning domain over $\mathcal{M}$, and $\text{RM-}\overline{\Pi}\xspace$, $\text{RM-}\overline{\pi}\xspace$ and $\text{RM-}\pi$ final state reward machines generated from $\mathcal{D}$ for the same task. Then, where $\rho_1 \geq \rho_2$ if and only if $v(\rho_1(s)) \geq v(\rho_2(s))$ for all states $s \in S$ of $\mathcal{M}$.

Figures (5)

Figure 1: MPRM for the bridge task. Positive and negative postconditions are respectively denoted with a superscript $+$ and $-$.
Figure 2: Results for the bridge task.
Figure 3: Results for the gold task.
Figure 4: Results for the gold-or-gem task.
Figure 5: Illustration of the behaviour of the MPRM and POP-trained agents on the gold-or-gem task.

Theorems & Definitions (9)

Definition 1.1: Labelled MDP
Definition 1.2: Reward Machine
Definition 1.3: MDPRM
Definition 1.4: Set of all partial-order plans
Theorem 2.1
proof
Definition 2.2
Theorem 2.3
proof

Maximally Permissive Reward Machines

TL;DR

Abstract

Maximally Permissive Reward Machines

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (9)