Table of Contents
Fetching ...

Reinforcement Learning for Long-Horizon Unordered Tasks: From Boolean to Coupled Reward Machines

Kristina Levina, Nikolaos Pappas, Athanasios Karapantelakis, Aneta Vulgarakis Feljan, Jendrik Seipp

TL;DR

This work tackles the challenge of reinforcement learning for long-horizon tasks with unordered subtasks by introducing three generalisations of Reward Machines: Numeric RM for compact task descriptions, Agenda RM that tracks remaining subtasks, and Coupled RM that enables parallel low-level learning. The authors propose Q-learning with Coupled RMs (CoRM), a compositional algorithm that learns per-subtask policies while choosing the next subtask based on observed completion efficiency, preserving global optimality via a tailored reward structure. Empirical results across Delivery, Office, and Water domains show CoRM scaling better and converging faster than state-of-the-art RM approaches, with reduced memory demands and improved robustness to completed subtasks. The method offers a practical path to solving complex unordered non-Markovian tasks and suggests future work in stochastic settings and integration with deep RL techniques.

Abstract

Reward machines (RMs) inform reinforcement learning agents about the reward structure of the environment. This is particularly advantageous for complex non-Markovian tasks because agents with access to RMs can learn more efficiently from fewer samples. However, learning with RMs is ill-suited for long-horizon problems in which a set of subtasks can be executed in any order. In such cases, the amount of information to learn increases exponentially with the number of unordered subtasks. In this work, we address this limitation by introducing three generalisations of RMs: (1) Numeric RMs allow users to express complex tasks in a compact form. (2) In Agenda RMs, states are associated with an agenda that tracks the remaining subtasks to complete. (3) Coupled RMs have coupled states associated with each subtask in the agenda. Furthermore, we introduce a new compositional learning algorithm that leverages coupled RMs: Q-learning with coupled RMs (CoRM). Our experiments show that CoRM scales better than state-of-the-art RM algorithms for long-horizon problems with unordered subtasks.

Reinforcement Learning for Long-Horizon Unordered Tasks: From Boolean to Coupled Reward Machines

TL;DR

This work tackles the challenge of reinforcement learning for long-horizon tasks with unordered subtasks by introducing three generalisations of Reward Machines: Numeric RM for compact task descriptions, Agenda RM that tracks remaining subtasks, and Coupled RM that enables parallel low-level learning. The authors propose Q-learning with Coupled RMs (CoRM), a compositional algorithm that learns per-subtask policies while choosing the next subtask based on observed completion efficiency, preserving global optimality via a tailored reward structure. Empirical results across Delivery, Office, and Water domains show CoRM scaling better and converging faster than state-of-the-art RM approaches, with reduced memory demands and improved robustness to completed subtasks. The method offers a practical path to solving complex unordered non-Markovian tasks and suggests future work in stochastic settings and integration with deep RL techniques.

Abstract

Reward machines (RMs) inform reinforcement learning agents about the reward structure of the environment. This is particularly advantageous for complex non-Markovian tasks because agents with access to RMs can learn more efficiently from fewer samples. However, learning with RMs is ill-suited for long-horizon problems in which a set of subtasks can be executed in any order. In such cases, the amount of information to learn increases exponentially with the number of unordered subtasks. In this work, we address this limitation by introducing three generalisations of RMs: (1) Numeric RMs allow users to express complex tasks in a compact form. (2) In Agenda RMs, states are associated with an agenda that tracks the remaining subtasks to complete. (3) Coupled RMs have coupled states associated with each subtask in the agenda. Furthermore, we introduce a new compositional learning algorithm that leverages coupled RMs: Q-learning with coupled RMs (CoRM). Our experiments show that CoRM scales better than state-of-the-art RM algorithms for long-horizon problems with unordered subtasks.

Paper Structure

This paper contains 20 sections, 2 theorems, 10 equations, 3 figures.

Key Result

Theorem 1

Domain $\mathcal{D}_{p_w}$ fully captures possible changes between the current and previous values of $w$: $w_t$ and $w_{t-1}$.

Figures (3)

  • Figure 1: (a) Example Delivery instance with agent A, station s, and two boxes b$_1$ and b$_2$. The thick black lines depict walls. (b) Numeric reward machine (RM) for the Delivery domain. Discrete numeric variable counts the number of uncollected boxes and is mapped to numeric feature $b$. Here, $b\space\raisebox{.15ex}{\footnotesize$\downarrow$}$ is true when a box is collected but some boxes remain uncollected, $b$$\downarrow$$\space\raisebox{-0.8ex}{--}\space$ is true when all boxes are collected, and $b$! is true when not all boxes are collected. Boolean feature $\texttt{s}$ is true when the agent arrives at the station.
  • Figure 2: RMs for the Delivery domain with two boxes shown in Figure \ref{['fig:delivery']}. Features b$_1$ and b$_2$ become true when the agent picks up box $1$ and box $2$, respectively. Symmetric states are shown in yellow and blue.
  • Figure 3: Results of CoRM (ours) and $Q$-learning with and without counterfactual reasoning (CRM and QRM, respectively) with Boolean RMs (B-CRM and B-QRM) and our agenda RMs ($\mathcal{T}$-CRM and $\mathcal{T}$-QRM) for the Delivery (a,b), Office (c,d), and Water (e,f) domains. CoRM-0 stands for CoRM without the joint optimisation from Eq. \ref{['eq:r_K']}. (g--i) Time (in seconds) to run $10^6$ steps in relation to the number of objectives. (j) Office environment used for experiments in (c,d) with agent A, offices o$_i, i=1, 2, \dots, 6$, decorations marked by stars, and coffee machines.

Theorems & Definitions (8)

  • Definition 1: Boolean reward machine
  • Theorem 1
  • proof
  • Definition 2: Numeric reward machine
  • Definition 3: Agenda reward machine
  • Theorem 2
  • proof : Proof
  • Definition 4: Coupled reward machine