Maximally Permissive Reward Machines
Giovanni Varricchione, Natasha Alechina, Mehdi Dastani, Brian Logan
TL;DR
The paper tackles learning rewards for temporally extended tasks by constructing maximally permissive reward machines (MPRMs) from the entire set of partial-order plans for a task. MPRMs track prefixes of plan linearisations and provide rewards that encourage completing a plan, yielding policies at least as good as those learned with reward machines based on a single plan, under goal-adequate planning domains. The authors prove theoretical guarantees comparing RM variants and validate them empirically in CraftWorld, where MPRMs typically yield higher rewards albeit with slower convergence due to increased flexibility. This approach offers a principled way to encode broad planning-based guidance into reinforcement learning, improving policy quality for complex tasks in discrete environments and suggesting promising directions for scalable planning-based RL.
Abstract
Reward machines allow the definition of rewards for temporally extended tasks and behaviors. Specifying "informative" reward machines can be challenging. One way to address this is to generate reward machines from a high-level abstract description of the learning environment, using techniques such as AI planning. However, previous planning-based approaches generate a reward machine based on a single (sequential or partial-order) plan, and do not allow maximum flexibility to the learning agent. In this paper we propose a new approach to synthesising reward machines which is based on the set of partial order plans for a goal. We prove that learning using such "maximally permissive" reward machines results in higher rewards than learning using RMs based on a single plan. We present experimental results which support our theoretical claims by showing that our approach obtains higher rewards than the single-plan approach in practice.
