A PAC Learning Algorithm for LTL and Omega-regular Objectives in MDPs

Mateo Perez; Fabio Somenzi; Ashutosh Trivedi

A PAC Learning Algorithm for LTL and Omega-regular Objectives in MDPs

Mateo Perez, Fabio Somenzi, Ashutosh Trivedi

TL;DR

This work addresses learning policies in MDPs that satisfy LTL and ω-regular (ω-regular) objectives, which capture non-Markovian goals in reinforcement learning. It introduces a model-based PAC framework, $\omega$-PAC, that leverages a new speed metric, the $\varepsilon$-recurrence time $T$, to obtain polynomial sample complexity in $|S|$, $|A|$, $T$, $1/\varepsilon$, and $1/\delta$. The algorithm maintains an optimistic learned MDP, labels edges as known or unknown using a threshold $k$, and uses an accepting sink for unknown transitions to ensure exploration; it returns a $6\varepsilon$-optimal policy with probability at least $1-\delta$. The approach naturally extends to on-the-fly product MDPs with ω-automata, enabling practical PAC guarantees for LTL/ω-regular objectives, as validated by gridworld and chain experiments. Overall, the work advances model-based RL for complex temporal objectives by formalizing convergence speed and proving polynomial sample efficiency.

Abstract

Linear temporal logic (LTL) and omega-regular objectives -- a superset of LTL -- have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes (MDPs). As part of the development of our algorithm, we introduce the epsilon-recurrence time: a measure of the speed at which a policy converges to the satisfaction of the omega-regular objective in the limit. We prove that our algorithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.

A PAC Learning Algorithm for LTL and Omega-regular Objectives in MDPs

TL;DR

-PAC, that leverages a new speed metric, the

-recurrence time

, to obtain polynomial sample complexity in

, and

. The algorithm maintains an optimistic learned MDP, labels edges as known or unknown using a threshold

, and uses an accepting sink for unknown transitions to ensure exploration; it returns a

-optimal policy with probability at least

. The approach naturally extends to on-the-fly product MDPs with ω-automata, enabling practical PAC guarantees for LTL/ω-regular objectives, as validated by gridworld and chain experiments. Overall, the work advances model-based RL for complex temporal objectives by formalizing convergence speed and proving polynomial sample efficiency.

Abstract

Paper Structure (11 sections, 7 theorems, 11 equations, 3 figures, 2 algorithms)

This paper contains 11 sections, 7 theorems, 11 equations, 3 figures, 2 algorithms.

Introduction
Contributions.
Related work.
Preliminaries
Main Results
The $\omega$-PAC Algorithm
Experiments
Gridworld example.
Chain example.
Conclusion
Detailed Algorithm

Key Result

Lemma 1

Let $M = (S, P, s_0)$ be a Markov chain and $p_{\min} = \min_{\{s,s' | P(s,s') > 0\}} P(s,s')$ be the minimum positive transition probability in $M$. Then the $\varepsilon$-recurrence time $T$ satisfies $T \le 2|S| \frac{\log(\varepsilon/2)}{\log(1-p_{\min}^{|S|})}$ .

Figures (3)

Figure 1: Example adopted from alur2022framework. The objective is to remain in $s_0$ forever.
Figure 2: The gridworld example. The red cell is a trap. The goal is to visit the two states $s$ and $g$ repeatedly.
Figure 3: The distribution of the probability of satisfaction of learned policies for different values of $k$ for the chain example. We also show the optimal probability of satisfaction $p$ and the threshold for $6\varepsilon$-optimality.

Theorems & Definitions (18)

Example 1: Intractability of LTL
Definition 1
Lemma 1
proof
Definition 2
Definition 3
Lemma 2
proof
Lemma 3
proof
...and 8 more

A PAC Learning Algorithm for LTL and Omega-regular Objectives in MDPs

TL;DR

Abstract

A PAC Learning Algorithm for LTL and Omega-regular Objectives in MDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (18)