Table of Contents
Fetching ...

Average Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives

Milad Kazemi, Mateo Perez, Fabio Somenzi, Sadegh Soudjani, Ashutosh Trivedi, Alvaro Velasquez

TL;DR

The paper tackles learning policies for continuing tasks subject to omega-regular specifications by translating absolute liveness properties into average-reward objectives in a model-free setting. It introduces reward machines and a synchronized product construction that preserves the necessary communication properties, enabling convergent learning with model-free RL algorithms. A lexicographic extension is developed to optimize a secondary mean-payoff objective among policies that satisfy the qualitative specification, with convergence guarantees under weakly or strictly communicating MDPs. Empirical results on communicating MDP benchmarks show the approach outperforms discount-based methods and works without episodic resets, highlighting the practicality of average-reward RL for long-running, specification-driven tasks.

Abstract

Recent advances in reinforcement learning (RL) have renewed focus on the design of reward functions that shape agent behavior. Manually designing reward functions is tedious and error-prone. A principled alternative is to specify behaviors in a formal language that can be automatically translated into rewards. Omega-regular languages are a natural choice for this purpose, given their established role in formal verification and synthesis. However, existing methods using omega-regular specifications typically rely on discounted reward RL in episodic settings, with periodic resets. This setup misaligns with the semantics of omega-regular specifications, which describe properties over infinite behavior traces. In such cases, the average reward criterion and the continuing setting -- where the agent interacts with the environment over a single, uninterrupted lifetime -- are more appropriate. To address the challenges of infinite-horizon, continuing tasks, we focus on absolute liveness specifications -- a subclass of omega-regular languages that cannot be violated by any finite behavior prefix, making them well-suited to the continuing setting. We present the first model-free RL framework that translates absolute liveness specifications to average-reward objectives. Our approach enables learning in communicating MDPs without episodic resetting. We also introduce a reward structure for lexicographic multi-objective optimization, aiming to maximize an external average-reward objective among the policies that also maximize the satisfaction probability of a given omega-regular specification. Our method guarantees convergence in unknown communicating MDPs and supports on-the-fly reductions that do not require full knowledge of the environment, thus enabling model-free RL. Empirical results show our average-reward approach in continuing setting outperforms discount-based methods across benchmarks.

Average Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives

TL;DR

The paper tackles learning policies for continuing tasks subject to omega-regular specifications by translating absolute liveness properties into average-reward objectives in a model-free setting. It introduces reward machines and a synchronized product construction that preserves the necessary communication properties, enabling convergent learning with model-free RL algorithms. A lexicographic extension is developed to optimize a secondary mean-payoff objective among policies that satisfy the qualitative specification, with convergence guarantees under weakly or strictly communicating MDPs. Empirical results on communicating MDP benchmarks show the approach outperforms discount-based methods and works without episodic resets, highlighting the practicality of average-reward RL for long-running, specification-driven tasks.

Abstract

Recent advances in reinforcement learning (RL) have renewed focus on the design of reward functions that shape agent behavior. Manually designing reward functions is tedious and error-prone. A principled alternative is to specify behaviors in a formal language that can be automatically translated into rewards. Omega-regular languages are a natural choice for this purpose, given their established role in formal verification and synthesis. However, existing methods using omega-regular specifications typically rely on discounted reward RL in episodic settings, with periodic resets. This setup misaligns with the semantics of omega-regular specifications, which describe properties over infinite behavior traces. In such cases, the average reward criterion and the continuing setting -- where the agent interacts with the environment over a single, uninterrupted lifetime -- are more appropriate. To address the challenges of infinite-horizon, continuing tasks, we focus on absolute liveness specifications -- a subclass of omega-regular languages that cannot be violated by any finite behavior prefix, making them well-suited to the continuing setting. We present the first model-free RL framework that translates absolute liveness specifications to average-reward objectives. Our approach enables learning in communicating MDPs without episodic resetting. We also introduce a reward structure for lexicographic multi-objective optimization, aiming to maximize an external average-reward objective among the policies that also maximize the satisfaction probability of a given omega-regular specification. Our method guarantees convergence in unknown communicating MDPs and supports on-the-fly reductions that do not require full knowledge of the environment, thus enabling model-free RL. Empirical results show our average-reward approach in continuing setting outperforms discount-based methods across benchmarks.

Paper Structure

This paper contains 14 sections, 11 theorems, 17 equations, 6 figures, 2 tables.

Key Result

Lemma 3.4

Let $\mathcal{A}$ be a deterministic $\omega$-automaton with initial state $s_0$. $\mathcal{A}$ accepts an absolute liveness specification if, and only if, the language accepted from any reachable state $s$ of $\mathcal{A}$ contains the language accepted from $s_0$.

Figures (6)

  • Figure 1: Automaton for $(\mathop{\mathrm{\mathsf{F}}}\nolimits\mathop{\mathrm{\mathsf{G}}}\nolimits a) {\vee} (\mathop{\mathrm{\mathsf{F}}}\nolimits\mathop{\mathrm{\mathsf{G}}}\nolimits \neg a)$.
  • Figure 2: Automaton of $\mathop{\mathrm{\mathsf{F}}}\nolimits\mathop{\mathrm{\mathsf{G}}}\nolimits a$, dashed lines represent resets.
  • Figure 3: MDP, each transition represents an action.
  • Figure 4: Example showing that the lexicographically optimal policy may require infinite memory. Actions are denoted as arrows; the reward is $0$ unless otherwise indicated. The specification of interest is $\mathop{\mathrm{\mathsf{G}}}\nolimits\mathop{\mathrm{\mathsf{F}}}\nolimits a$.
  • Figure 5: Picture of the construction of the probabilistic reward machine in \ref{['eq:prm']}. Left: Two layers corresponding to $b\in\{0,1\}$. Right: Probabilistic changes of the additional bit $b$. The $\epsilon$-transitions are excluded for a better pictorial presentation.
  • ...and 1 more figures

Theorems & Definitions (22)

  • Definition 2.1: Markov decision process (MDP)
  • Definition 2.2: Probabilistic Reward Machines
  • Definition 2.3: Product (rewardful) MDP
  • Definition 2.4: Büchi Automata
  • Definition 2.5: Product MDP
  • Definition 2.6: Good-for-MDP Hahn20
  • Definition 3.1: Safety and Liveness
  • Definition 3.2: Absolute Liveness
  • Definition 3.3: Stable Specification
  • Lemma 3.4
  • ...and 12 more