A PAC Learning Algorithm for LTL and Omega-regular Objectives in MDPs
Mateo Perez, Fabio Somenzi, Ashutosh Trivedi
TL;DR
This work addresses learning policies in MDPs that satisfy LTL and ω-regular (ω-regular) objectives, which capture non-Markovian goals in reinforcement learning. It introduces a model-based PAC framework, $\omega$-PAC, that leverages a new speed metric, the $\varepsilon$-recurrence time $T$, to obtain polynomial sample complexity in $|S|$, $|A|$, $T$, $1/\varepsilon$, and $1/\delta$. The algorithm maintains an optimistic learned MDP, labels edges as known or unknown using a threshold $k$, and uses an accepting sink for unknown transitions to ensure exploration; it returns a $6\varepsilon$-optimal policy with probability at least $1-\delta$. The approach naturally extends to on-the-fly product MDPs with ω-automata, enabling practical PAC guarantees for LTL/ω-regular objectives, as validated by gridworld and chain experiments. Overall, the work advances model-based RL for complex temporal objectives by formalizing convergence speed and proving polynomial sample efficiency.
Abstract
Linear temporal logic (LTL) and omega-regular objectives -- a superset of LTL -- have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes (MDPs). As part of the development of our algorithm, we introduce the epsilon-recurrence time: a measure of the speed at which a policy converges to the satisfaction of the omega-regular objective in the limit. We prove that our algorithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.
