Table of Contents
Fetching ...

A PAC Learning Algorithm for LTL and Omega-regular Objectives in MDPs

Mateo Perez, Fabio Somenzi, Ashutosh Trivedi

TL;DR

This work addresses learning policies in MDPs that satisfy LTL and ω-regular (ω-regular) objectives, which capture non-Markovian goals in reinforcement learning. It introduces a model-based PAC framework, $\omega$-PAC, that leverages a new speed metric, the $\varepsilon$-recurrence time $T$, to obtain polynomial sample complexity in $|S|$, $|A|$, $T$, $1/\varepsilon$, and $1/\delta$. The algorithm maintains an optimistic learned MDP, labels edges as known or unknown using a threshold $k$, and uses an accepting sink for unknown transitions to ensure exploration; it returns a $6\varepsilon$-optimal policy with probability at least $1-\delta$. The approach naturally extends to on-the-fly product MDPs with ω-automata, enabling practical PAC guarantees for LTL/ω-regular objectives, as validated by gridworld and chain experiments. Overall, the work advances model-based RL for complex temporal objectives by formalizing convergence speed and proving polynomial sample efficiency.

Abstract

Linear temporal logic (LTL) and omega-regular objectives -- a superset of LTL -- have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes (MDPs). As part of the development of our algorithm, we introduce the epsilon-recurrence time: a measure of the speed at which a policy converges to the satisfaction of the omega-regular objective in the limit. We prove that our algorithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.

A PAC Learning Algorithm for LTL and Omega-regular Objectives in MDPs

TL;DR

This work addresses learning policies in MDPs that satisfy LTL and ω-regular (ω-regular) objectives, which capture non-Markovian goals in reinforcement learning. It introduces a model-based PAC framework, -PAC, that leverages a new speed metric, the -recurrence time , to obtain polynomial sample complexity in , , , , and . The algorithm maintains an optimistic learned MDP, labels edges as known or unknown using a threshold , and uses an accepting sink for unknown transitions to ensure exploration; it returns a -optimal policy with probability at least . The approach naturally extends to on-the-fly product MDPs with ω-automata, enabling practical PAC guarantees for LTL/ω-regular objectives, as validated by gridworld and chain experiments. Overall, the work advances model-based RL for complex temporal objectives by formalizing convergence speed and proving polynomial sample efficiency.

Abstract

Linear temporal logic (LTL) and omega-regular objectives -- a superset of LTL -- have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes (MDPs). As part of the development of our algorithm, we introduce the epsilon-recurrence time: a measure of the speed at which a policy converges to the satisfaction of the omega-regular objective in the limit. We prove that our algorithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.
Paper Structure (11 sections, 7 theorems, 11 equations, 3 figures, 2 algorithms)

This paper contains 11 sections, 7 theorems, 11 equations, 3 figures, 2 algorithms.

Key Result

Lemma 1

Let $M = (S, P, s_0)$ be a Markov chain and $p_{\min} = \min_{\{s,s' | P(s,s') > 0\}} P(s,s')$ be the minimum positive transition probability in $M$. Then the $\varepsilon$-recurrence time $T$ satisfies $T \le 2|S| \frac{\log(\varepsilon/2)}{\log(1-p_{\min}^{|S|})}$ .

Figures (3)

  • Figure 1: Example adopted from alur2022framework. The objective is to remain in $s_0$ forever.
  • Figure 2: The gridworld example. The red cell is a trap. The goal is to visit the two states $s$ and $g$ repeatedly.
  • Figure 3: The distribution of the probability of satisfaction of learned policies for different values of $k$ for the chain example. We also show the optimal probability of satisfaction $p$ and the threshold for $6\varepsilon$-optimality.

Theorems & Definitions (18)

  • Example 1: Intractability of LTL
  • Definition 1
  • Lemma 1
  • proof
  • Definition 2
  • Definition 3
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • ...and 8 more