Table of Contents
Fetching ...

Reinforcement Learning with Symbolic Reward Machines

Thomas Krug, Daniel Neider

TL;DR

Symbolic Reward Machines (SRMs) are proposed together with the learning algorithms QSRM and LSRM to overcome the limitations of RMs and outperform the baseline RL approaches and generate the same results as the existing RM methods.

Abstract

Reward Machines (RMs) are an established mechanism in Reinforcement Learning (RL) to represent and learn sparse, temporally extended tasks with non-Markovian rewards. RMs rely on high-level information in the form of labels that are emitted by the environment alongside the observation. However, this concept requires manual user input for each environment and task. The user has to create a suitable labeling function that computes the labels. These limitations lead to poor applicability in widely adopted RL frameworks. We propose Symbolic Reward Machines (SRMs) together with the learning algorithms QSRM and LSRM to overcome the limitations of RMs. SRMs consume only the standard output of the environment and process the observation directly through guards that are represented by symbolic formulas. In our evaluation, our SRM methods outperform the baseline RL approaches and generate the same results as the existing RM methods. At the same time, our methods adhere to the widely used environment definition and provide interpretable representations of the task to the user.

Reinforcement Learning with Symbolic Reward Machines

TL;DR

Symbolic Reward Machines (SRMs) are proposed together with the learning algorithms QSRM and LSRM to overcome the limitations of RMs and outperform the baseline RL approaches and generate the same results as the existing RM methods.

Abstract

Reward Machines (RMs) are an established mechanism in Reinforcement Learning (RL) to represent and learn sparse, temporally extended tasks with non-Markovian rewards. RMs rely on high-level information in the form of labels that are emitted by the environment alongside the observation. However, this concept requires manual user input for each environment and task. The user has to create a suitable labeling function that computes the labels. These limitations lead to poor applicability in widely adopted RL frameworks. We propose Symbolic Reward Machines (SRMs) together with the learning algorithms QSRM and LSRM to overcome the limitations of RMs. SRMs consume only the standard output of the environment and process the observation directly through guards that are represented by symbolic formulas. In our evaluation, our SRM methods outperform the baseline RL approaches and generate the same results as the existing RM methods. At the same time, our methods adhere to the widely used environment definition and provide interpretable representations of the task to the user.
Paper Structure (25 sections, 5 theorems, 13 equations, 10 figures, 2 algorithms)

This paper contains 25 sections, 5 theorems, 13 equations, 10 figures, 2 algorithms.

Key Result

theorem 1

QSRM always converges to an optimal policy in the limit with the same conditions as Q-Learning WatkinsChristopherJ.C.H..1992Sutton.2020. So, if

Figures (10)

  • Figure 1: Office World environment. The discrete version is displayed. The labeled environments output the labels shown at the specific positions.
  • Figure 2: RM and SRM for the example task in the Office World. Self-loops with an output of zero are omitted in the RM.
  • Figure 3: RM and SRM of task diagonal_run for the Office World. Self-loops in the RM with zero rewards are omitted.
  • Figure 4: Original Mountain Car environment (left) and our version (right).
  • Figure 5: SRM for task 'rml' for our Mountain Car environment.
  • ...and 5 more figures

Theorems & Definitions (12)

  • definition 1: MDP
  • definition 2: SRM
  • theorem 1: QSRM Convergence
  • proof
  • theorem 2: LSRM-GF convergence to equivalent SRM
  • proof
  • corollary 1: LSRM-GF convergence to optimal policy
  • proof
  • theorem 3: LSRM-FT convergence to equivalent SRM
  • proof
  • ...and 2 more