Table of Contents
Fetching ...

Provably Correct Automata Embeddings for Optimal Automata-Conditioned Reinforcement Learning

Beyazit Yalcinkaya, Niklas Lauffer, Marcell Vazquez-Chanlatte, Sanjit A. Seshia

TL;DR

The paper tackles learning temporally extended tasks specified by deterministic automata in reinforcement learning by formalizing a DFA-conditioned MDP and proving PAC-learnability under finite DFA spaces. It then introduces a method to learn provably correct automata embeddings via a bisimulation metric learned together with a policy, ensuring that distinct embeddings correspond to non-bisimilar tasks. The encoder-based approach preserves task equivalence under bisimulation and yields improved downstream policy learning, as validated by experiments using a GATv2 DFA encoder with RAD DFAs. Overall, the work provides both theoretical guarantees and practical methods for sample-efficient, multi-task RL with runtime-specified temporal objectives. This advances principled automata-guided RL by tying task representations to formal equivalence through bisimulation and embedding learning.

Abstract

Automata-conditioned reinforcement learning (RL) has given promising results for learning multi-task policies capable of performing temporally extended objectives given at runtime, done by pretraining and freezing automata embeddings prior to training the downstream policy. However, no theoretical guarantees were given. This work provides a theoretical framework for the automata-conditioned RL problem and shows that it is probably approximately correct learnable. We then present a technique for learning provably correct automata embeddings, guaranteeing optimal multi-task policy learning. Our experimental evaluation confirms these theoretical results.

Provably Correct Automata Embeddings for Optimal Automata-Conditioned Reinforcement Learning

TL;DR

The paper tackles learning temporally extended tasks specified by deterministic automata in reinforcement learning by formalizing a DFA-conditioned MDP and proving PAC-learnability under finite DFA spaces. It then introduces a method to learn provably correct automata embeddings via a bisimulation metric learned together with a policy, ensuring that distinct embeddings correspond to non-bisimilar tasks. The encoder-based approach preserves task equivalence under bisimulation and yields improved downstream policy learning, as validated by experiments using a GATv2 DFA encoder with RAD DFAs. Overall, the work provides both theoretical guarantees and practical methods for sample-efficient, multi-task RL with runtime-specified temporal objectives. This advances principled automata-guided RL by tying task representations to formal equivalence through bisimulation and embedding learning.

Abstract

Automata-conditioned reinforcement learning (RL) has given promising results for learning multi-task policies capable of performing temporally extended objectives given at runtime, done by pretraining and freezing automata embeddings prior to training the downstream policy. However, no theoretical guarantees were given. This work provides a theoretical framework for the automata-conditioned RL problem and shows that it is probably approximately correct learnable. We then present a technique for learning provably correct automata embeddings, guaranteeing optimal multi-task policy learning. Our experimental evaluation confirms these theoretical results.

Paper Structure

This paper contains 16 sections, 10 theorems, 34 equations, 1 figure.

Key Result

Theorem 1

If a learning algorithm $\mathbb{A}$ is PAC-MDP as defined in defn:pac, then for any DFA-conditioned MDP $\mathcal{M} \mid_L \mathcal{M}_{D_{\Sigma, n}}$, $\epsilon > 0$, and $p \in (0,1)$, there exists a polynomial function s.t. the total number of $\epsilon$-suboptimal steps taken by $\mathbb{A}$ is at most $N'$ with probability at least $1 - p$.

Figures (1)

  • Figure 1: Left: learning curves for the method in \ref{['sec:bisim']}. Center: heatmap of \ref{['eqn:metric']} between various DFAs. Right: learning DFA-conditioned policies for RA tasks with number of states sampled uniformly from $[3,6]$, comparing DFA embeddings from \ref{['sec:bisim']} and yalcinkaya24compositional.

Theorems & Definitions (27)

  • Definition 1: Markov Decision Process
  • Definition 2: Deterministic Finite Automaton
  • Definition 3: Bisimulation Relation over DFAs
  • Definition 4: DFA Space
  • Definition 5: DFA-Conditioned MDP
  • Definition 6: DFA-Conditioned Reinforcement Learning Problem
  • Definition 7: Probably Approximately Correct Learnability in MDPs
  • Theorem 1
  • Definition 8: Bisimulation Relation over MDP states
  • Lemma 1
  • ...and 17 more