ParMod: A Parallel and Modular Framework for Learning Non-Markovian Tasks
Ruixuan Miao, Xu Lu, Cong Tian, Bin Yu, Zhenhua Duan
TL;DR
ParMod tackles learning non-Markovian tasks by translating temporal logic specifications into DFAs and constructing a product MDP to encode memory into state space. It then partitions the DFA into task phases via Task Phase Classification and trains each phase in parallel with dedicated networks, using a DFA-informed reward shaping scheme to improve sample efficiency. The framework is theoretically grounded in NM–to–MDP equivalence and optimality results, and empirically demonstrates faster learning, higher success rates, and better policy quality than baselines across Waterworld, Racecar, and Halfcheetah benchmarks. The approach offers scalable, parallelizable handling of temporally extended tasks and shows promise for distributed implementations and dynamic phase classification in future work.
Abstract
The commonly used Reinforcement Learning (RL) model, MDPs (Markov Decision Processes), has a basic premise that rewards depend on the current state and action only. However, many real-world tasks are non-Markovian, which has long-term memory and dependency. The reward sparseness problem is further amplified in non-Markovian scenarios. Hence learning a non-Markovian task (NMT) is inherently more difficult than learning a Markovian one. In this paper, we propose a novel \textbf{Par}allel and \textbf{Mod}ular RL framework, ParMod, specifically for learning NMTs specified by temporal logic. With the aid of formal techniques, the NMT is modulaized into a series of sub-tasks based on the automaton structure (equivalent to its temporal logic counterpart). On this basis, sub-tasks will be trained by a group of agents in a parallel fashion, with one agent handling one sub-task. Besides parallel training, the core of ParMod lies in: a flexible classification method for modularizing the NMT, and an effective reward shaping method for improving the sample efficiency. A comprehensive evaluation is conducted on several challenging benchmark problems with respect to various metrics. The experimental results show that ParMod achieves superior performance over other relevant studies. Our work thus provides a good synergy among RL, NMT and temporal logic.
