Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machine
Shayan Meshkat Alsadat, Jean-Raphael Gaglione, Daniel Neider, Ufuk Topcu, Zhe Xu
TL;DR
The paper tackles reinforcement learning with non-Markovian rewards by automatically injecting high-level domain knowledge from large language models into a reward-machine interface. It encodes this knowledge as deterministic finite automata (DFA) and reward machines (RM), and uses SAT-based automata learning to derive the minimal RM compatible with LLM-derived DFAs from data traces. A closed-loop prompt refinement mechanism uses counterexamples to iteratively improve the DFA, ensuring convergence to the ground-truth RM and an optimal policy with formal guarantees; empirical case studies show up to a 30% speedup in convergence. This framework reduces reliance on expert-crafted automata and enables domain-specific RL acceleration through automated, adaptive prompting and automata synthesis, while mitigating LLM hallucinations via compatibility checks.
Abstract
We present LARL-RM (Large language model-generated Automaton for Reinforcement Learning with Reward Machine) algorithm in order to encode high-level knowledge into reinforcement learning using automaton to expedite the reinforcement learning. Our method uses Large Language Models (LLM) to obtain high-level domain-specific knowledge using prompt engineering instead of providing the reinforcement learning algorithm directly with the high-level knowledge which requires an expert to encode the automaton. We use chain-of-thought and few-shot methods for prompt engineering and demonstrate that our method works using these approaches. Additionally, LARL-RM allows for fully closed-loop reinforcement learning without the need for an expert to guide and supervise the learning since LARL-RM can use the LLM directly to generate the required high-level knowledge for the task at hand. We also show the theoretical guarantee of our algorithm to converge to an optimal policy. We demonstrate that LARL-RM speeds up the convergence by 30% by implementing our method in two case studies.
