Table of Contents
Fetching ...

Detecting Hidden Triggers: Mapping Non-Markov Reward Functions to Markov

Gregory Hyde, Eugene Santos

TL;DR

This work addresses non-Markov rewards by learning a minimal Reward Machine (RM) representation without access to high-level symbols. It introduces Abstract Reward MDPs (ARMDPs), the cross-product of observed states with RM states, and an ILP-based procedure to map non-Markov reward data into a Markov framework, resolving reward conflicts via hidden triggers. A theoretical result shows that ARMDPs preserve reward expectations, and an active learning extension (ARMDPQ-Learning) integrates Q-learning to iteratively refine the RM while expanding the observed data. Empirically, the method recovers interpretable RM structures in Officeworld and Breakfastworld, improves representation efficiency for DQN variants via derived ARMDP state spaces, and demonstrates advantages of modeling rewards over histories for interdependent reward signals. Overall, this approach provides a principled, interpretable, and scalable pathway to handle non-Markov rewards in RL without prespecified symbolic labels.

Abstract

Many Reinforcement Learning algorithms assume a Markov reward function to guarantee optimality. However, not all reward functions are Markov. This paper proposes a framework for mapping non-Markov reward functions into equivalent Markov ones by learning specialized reward automata, Reward Machines. Unlike the general practice of learning Reward Machines, we do not require a set of high-level propositional symbols from which to learn. Rather, we learn hidden triggers, directly from data, that construct them. We demonstrate the importance of learning Reward Machines over their Deterministic Finite-State Automata counterparts given their ability to model reward dependencies. We formalize this distinction in our learning objective. Our mapping process is constructed as an Integer Linear Programming problem. We prove that our mappings form a suitable proxy for maximizing reward expectations. We empirically validate our approach by learning black-box, non-Markov reward functions in the Officeworld domain. Additionally, we demonstrate the effectiveness of learning reward dependencies in a new domain, Breakfastworld.

Detecting Hidden Triggers: Mapping Non-Markov Reward Functions to Markov

TL;DR

This work addresses non-Markov rewards by learning a minimal Reward Machine (RM) representation without access to high-level symbols. It introduces Abstract Reward MDPs (ARMDPs), the cross-product of observed states with RM states, and an ILP-based procedure to map non-Markov reward data into a Markov framework, resolving reward conflicts via hidden triggers. A theoretical result shows that ARMDPs preserve reward expectations, and an active learning extension (ARMDPQ-Learning) integrates Q-learning to iteratively refine the RM while expanding the observed data. Empirically, the method recovers interpretable RM structures in Officeworld and Breakfastworld, improves representation efficiency for DQN variants via derived ARMDP state spaces, and demonstrates advantages of modeling rewards over histories for interdependent reward signals. Overall, this approach provides a principled, interpretable, and scalable pathway to handle non-Markov rewards in RL without prespecified symbolic labels.

Abstract

Many Reinforcement Learning algorithms assume a Markov reward function to guarantee optimality. However, not all reward functions are Markov. This paper proposes a framework for mapping non-Markov reward functions into equivalent Markov ones by learning specialized reward automata, Reward Machines. Unlike the general practice of learning Reward Machines, we do not require a set of high-level propositional symbols from which to learn. Rather, we learn hidden triggers, directly from data, that construct them. We demonstrate the importance of learning Reward Machines over their Deterministic Finite-State Automata counterparts given their ability to model reward dependencies. We formalize this distinction in our learning objective. Our mapping process is constructed as an Integer Linear Programming problem. We prove that our mappings form a suitable proxy for maximizing reward expectations. We empirically validate our approach by learning black-box, non-Markov reward functions in the Officeworld domain. Additionally, we demonstrate the effectiveness of learning reward dependencies in a new domain, Breakfastworld.
Paper Structure (23 sections, 2 theorems, 23 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 23 sections, 2 theorems, 23 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

$RS(\tau_m) = ARS(\tau_m)$$\forall \tau_m \in T_{o}$.

Figures (7)

  • Figure 1: The Officeworld domain (a) with four Reward Machines, (b) deliver coffee to office, (c) deliver mail to office, (d) deliver coffee and mail to office and (e) patrol task sequencing A, B, C and D.
  • Figure 2: Average rewards with standard deviations, evaluated over 10 trials using ARMDPQ-Learning on Tasks (b-e) in the Officeworld domain.
  • Figure 3: Average rewards with standard deviations, evaluated over 10 trials using DQN models on Tasks (b-e) in the Officeworld domain.
  • Figure 4: The Breakfastworld domain (a) with two RMs, (b) to cook, eat and then leave, (c) to cook, eat, wash, and then leave.
  • Figure 5: Patrol task where (a) patrols only once and (b) patrols continuously
  • ...and 2 more figures

Theorems & Definitions (4)

  • Definition 1
  • Theorem 1
  • Theorem 1
  • proof