Table of Contents
Fetching ...

Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machine

Shayan Meshkat Alsadat, Jean-Raphael Gaglione, Daniel Neider, Ufuk Topcu, Zhe Xu

TL;DR

The paper tackles reinforcement learning with non-Markovian rewards by automatically injecting high-level domain knowledge from large language models into a reward-machine interface. It encodes this knowledge as deterministic finite automata (DFA) and reward machines (RM), and uses SAT-based automata learning to derive the minimal RM compatible with LLM-derived DFAs from data traces. A closed-loop prompt refinement mechanism uses counterexamples to iteratively improve the DFA, ensuring convergence to the ground-truth RM and an optimal policy with formal guarantees; empirical case studies show up to a 30% speedup in convergence. This framework reduces reliance on expert-crafted automata and enables domain-specific RL acceleration through automated, adaptive prompting and automata synthesis, while mitigating LLM hallucinations via compatibility checks.

Abstract

We present LARL-RM (Large language model-generated Automaton for Reinforcement Learning with Reward Machine) algorithm in order to encode high-level knowledge into reinforcement learning using automaton to expedite the reinforcement learning. Our method uses Large Language Models (LLM) to obtain high-level domain-specific knowledge using prompt engineering instead of providing the reinforcement learning algorithm directly with the high-level knowledge which requires an expert to encode the automaton. We use chain-of-thought and few-shot methods for prompt engineering and demonstrate that our method works using these approaches. Additionally, LARL-RM allows for fully closed-loop reinforcement learning without the need for an expert to guide and supervise the learning since LARL-RM can use the LLM directly to generate the required high-level knowledge for the task at hand. We also show the theoretical guarantee of our algorithm to converge to an optimal policy. We demonstrate that LARL-RM speeds up the convergence by 30% by implementing our method in two case studies.

Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machine

TL;DR

The paper tackles reinforcement learning with non-Markovian rewards by automatically injecting high-level domain knowledge from large language models into a reward-machine interface. It encodes this knowledge as deterministic finite automata (DFA) and reward machines (RM), and uses SAT-based automata learning to derive the minimal RM compatible with LLM-derived DFAs from data traces. A closed-loop prompt refinement mechanism uses counterexamples to iteratively improve the DFA, ensuring convergence to the ground-truth RM and an optimal policy with formal guarantees; empirical case studies show up to a 30% speedup in convergence. This framework reduces reliance on expert-crafted automata and enables domain-specific RL acceleration through automated, adaptive prompting and automata synthesis, while mitigating LLM hallucinations via compatibility checks.

Abstract

We present LARL-RM (Large language model-generated Automaton for Reinforcement Learning with Reward Machine) algorithm in order to encode high-level knowledge into reinforcement learning using automaton to expedite the reinforcement learning. Our method uses Large Language Models (LLM) to obtain high-level domain-specific knowledge using prompt engineering instead of providing the reinforcement learning algorithm directly with the high-level knowledge which requires an expert to encode the automaton. We use chain-of-thought and few-shot methods for prompt engineering and demonstrate that our method works using these approaches. Additionally, LARL-RM allows for fully closed-loop reinforcement learning without the need for an expert to guide and supervise the learning since LARL-RM can use the LLM directly to generate the required high-level knowledge for the task at hand. We also show the theoretical guarantee of our algorithm to converge to an optimal policy. We demonstrate that LARL-RM speeds up the convergence by 30% by implementing our method in two case studies.
Paper Structure (18 sections, 13 theorems, 8 equations, 19 figures, 2 algorithms)

This paper contains 18 sections, 13 theorems, 8 equations, 19 figures, 2 algorithms.

Key Result

Lemma 1

Let $\mathcal{M}$ be a labeled MDP, $\mathcal{A}$ the ground truth reward machine encoding the rewards of $\mathcal{M}$, and $\mathscr{D}^\star = \{\mathcal{D}_{1}, \ldots, \mathcal{D}_{m} \}$ the set of all LLM-generated DFAs that are added to $\mathcal{D}$ during the run of LARL-RM. Additionally,

Figures (19)

  • Figure 1: An autonomous car (agent) must first go to intersection $J$ and then make a right turn to go to intersection $b$. Agent must check for the green traffic light $g$, car $c$, and pedestrian $p$.
  • Figure 2: DFA of the motivating example where the agent must reach intersection $J$ while avoiding cars and pedestrians.
  • Figure 3: An LLM-generated DFA for our motivating example.
  • Figure 4: Prompting GPT-3.5-Turbo to adopt a persona as an expert in traffic rules (GPT response and prompt).
  • Figure 5: Mapping the output of the LLM to a specific set of propositions.
  • ...and 14 more figures

Theorems & Definitions (28)

  • Definition 1: labeled Markov decision process
  • Definition 2: Deterministic finite automata
  • Definition 3: Reward machine
  • Definition 4: LLM-generated DFA compatibility
  • Lemma 1
  • Theorem 1
  • proof : Proof of Lemma \ref{['lem:learn-correct-RM']}
  • Theorem 2
  • Lemma 2
  • proof
  • ...and 18 more