Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machine

Shayan Meshkat Alsadat; Jean-Raphael Gaglione; Daniel Neider; Ufuk Topcu; Zhe Xu

Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machine

Shayan Meshkat Alsadat, Jean-Raphael Gaglione, Daniel Neider, Ufuk Topcu, Zhe Xu

TL;DR

The paper tackles reinforcement learning with non-Markovian rewards by automatically injecting high-level domain knowledge from large language models into a reward-machine interface. It encodes this knowledge as deterministic finite automata (DFA) and reward machines (RM), and uses SAT-based automata learning to derive the minimal RM compatible with LLM-derived DFAs from data traces. A closed-loop prompt refinement mechanism uses counterexamples to iteratively improve the DFA, ensuring convergence to the ground-truth RM and an optimal policy with formal guarantees; empirical case studies show up to a 30% speedup in convergence. This framework reduces reliance on expert-crafted automata and enables domain-specific RL acceleration through automated, adaptive prompting and automata synthesis, while mitigating LLM hallucinations via compatibility checks.

Abstract

We present LARL-RM (Large language model-generated Automaton for Reinforcement Learning with Reward Machine) algorithm in order to encode high-level knowledge into reinforcement learning using automaton to expedite the reinforcement learning. Our method uses Large Language Models (LLM) to obtain high-level domain-specific knowledge using prompt engineering instead of providing the reinforcement learning algorithm directly with the high-level knowledge which requires an expert to encode the automaton. We use chain-of-thought and few-shot methods for prompt engineering and demonstrate that our method works using these approaches. Additionally, LARL-RM allows for fully closed-loop reinforcement learning without the need for an expert to guide and supervise the learning since LARL-RM can use the LLM directly to generate the required high-level knowledge for the task at hand. We also show the theoretical guarantee of our algorithm to converge to an optimal policy. We demonstrate that LARL-RM speeds up the convergence by 30% by implementing our method in two case studies.

Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machine

TL;DR

Abstract

Paper Structure (18 sections, 13 theorems, 8 equations, 19 figures, 2 algorithms)

This paper contains 18 sections, 13 theorems, 8 equations, 19 figures, 2 algorithms.

Introduction
Preliminaries
Generating Domain Specific Knowledge Using LLM
Expediting Reinforcement Learning Using LLM-generated DFA
Using LLM-generated DFA to Learn the Reward Machine
Refinement of Prompt and DFA
Convergence to Optimal Policy
Case Studies
Case Study 1
Case Study 2
Conclusion
Comparison of closed-loop and Open-loop LARL-RM
Case Study 1
Case Study 2
Theoretical Guarantee of the LARL-RM
...and 3 more sections

Key Result

Lemma 1

Let $\mathcal{M}$ be a labeled MDP, $\mathcal{A}$ the ground truth reward machine encoding the rewards of $\mathcal{M}$, and $\mathscr{D}^\star = \{\mathcal{D}_{1}, \ldots, \mathcal{D}_{m} \}$ the set of all LLM-generated DFAs that are added to $\mathcal{D}$ during the run of LARL-RM. Additionally,

Figures (19)

Figure 1: An autonomous car (agent) must first go to intersection $J$ and then make a right turn to go to intersection $b$. Agent must check for the green traffic light $g$, car $c$, and pedestrian $p$.
Figure 2: DFA of the motivating example where the agent must reach intersection $J$ while avoiding cars and pedestrians.
Figure 3: An LLM-generated DFA for our motivating example.
Figure 4: Prompting GPT-3.5-Turbo to adopt a persona as an expert in traffic rules (GPT response and prompt).
Figure 5: Mapping the output of the LLM to a specific set of propositions.
...and 14 more figures

Theorems & Definitions (28)

Definition 1: labeled Markov decision process
Definition 2: Deterministic finite automata
Definition 3: Reward machine
Definition 4: LLM-generated DFA compatibility
Lemma 1
Theorem 1
proof : Proof of Lemma \ref{['lem:learn-correct-RM']}
Theorem 2
Lemma 2
proof
...and 18 more

Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machine

TL;DR

Abstract

Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machine

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (28)