Reinforcement Learning with Symbolic Reward Machines

Thomas Krug; Daniel Neider

Reinforcement Learning with Symbolic Reward Machines

Thomas Krug, Daniel Neider

TL;DR

Symbolic Reward Machines (SRMs) are proposed together with the learning algorithms QSRM and LSRM to overcome the limitations of RMs and outperform the baseline RL approaches and generate the same results as the existing RM methods.

Abstract

Reward Machines (RMs) are an established mechanism in Reinforcement Learning (RL) to represent and learn sparse, temporally extended tasks with non-Markovian rewards. RMs rely on high-level information in the form of labels that are emitted by the environment alongside the observation. However, this concept requires manual user input for each environment and task. The user has to create a suitable labeling function that computes the labels. These limitations lead to poor applicability in widely adopted RL frameworks. We propose Symbolic Reward Machines (SRMs) together with the learning algorithms QSRM and LSRM to overcome the limitations of RMs. SRMs consume only the standard output of the environment and process the observation directly through guards that are represented by symbolic formulas. In our evaluation, our SRM methods outperform the baseline RL approaches and generate the same results as the existing RM methods. At the same time, our methods adhere to the widely used environment definition and provide interpretable representations of the task to the user.

Reinforcement Learning with Symbolic Reward Machines

TL;DR

Abstract

Paper Structure (25 sections, 5 theorems, 13 equations, 10 figures, 2 algorithms)

This paper contains 25 sections, 5 theorems, 13 equations, 10 figures, 2 algorithms.

Introduction
Preliminaries
Symbolic Reward Machines
Learning effective policies with given SRMs
Learning Symbolic Reward Machines
LSRM with Given Formulas
Variables
Constraints
Generate SRM from Model
LSRM with Formula Templates
Variables
Constraints
Generate SRM from Model
LSRM Convergence
Convergence of LSRM with Given Formulas
...and 10 more sections

Key Result

theorem 1

QSRM always converges to an optimal policy in the limit with the same conditions as Q-Learning WatkinsChristopherJ.C.H..1992Sutton.2020. So, if

Figures (10)

Figure 1: Office World environment. The discrete version is displayed. The labeled environments output the labels shown at the specific positions.
Figure 2: RM and SRM for the example task in the Office World. Self-loops with an output of zero are omitted in the RM.
Figure 3: RM and SRM of task diagonal_run for the Office World. Self-loops in the RM with zero rewards are omitted.
Figure 4: Original Mountain Car environment (left) and our version (right).
Figure 5: SRM for task 'rml' for our Mountain Car environment.
...and 5 more figures

Theorems & Definitions (12)

definition 1: MDP
definition 2: SRM
theorem 1: QSRM Convergence
proof
theorem 2: LSRM-GF convergence to equivalent SRM
proof
corollary 1: LSRM-GF convergence to optimal policy
proof
theorem 3: LSRM-FT convergence to equivalent SRM
proof
...and 2 more

Reinforcement Learning with Symbolic Reward Machines

TL;DR

Abstract

Reinforcement Learning with Symbolic Reward Machines

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (12)