Provably Correct Automata Embeddings for Optimal Automata-Conditioned Reinforcement Learning
Beyazit Yalcinkaya, Niklas Lauffer, Marcell Vazquez-Chanlatte, Sanjit A. Seshia
TL;DR
The paper tackles learning temporally extended tasks specified by deterministic automata in reinforcement learning by formalizing a DFA-conditioned MDP and proving PAC-learnability under finite DFA spaces. It then introduces a method to learn provably correct automata embeddings via a bisimulation metric learned together with a policy, ensuring that distinct embeddings correspond to non-bisimilar tasks. The encoder-based approach preserves task equivalence under bisimulation and yields improved downstream policy learning, as validated by experiments using a GATv2 DFA encoder with RAD DFAs. Overall, the work provides both theoretical guarantees and practical methods for sample-efficient, multi-task RL with runtime-specified temporal objectives. This advances principled automata-guided RL by tying task representations to formal equivalence through bisimulation and embedding learning.
Abstract
Automata-conditioned reinforcement learning (RL) has given promising results for learning multi-task policies capable of performing temporally extended objectives given at runtime, done by pretraining and freezing automata embeddings prior to training the downstream policy. However, no theoretical guarantees were given. This work provides a theoretical framework for the automata-conditioned RL problem and shows that it is probably approximately correct learnable. We then present a technique for learning provably correct automata embeddings, guaranteeing optimal multi-task policy learning. Our experimental evaluation confirms these theoretical results.
