Table of Contents
Fetching ...

Synthesizing Programmatic Reinforcement Learning Policies with Large Language Model Guided Search

Max Liu, Chan-Hung Yu, Wei-Hsu Lee, Cheng-Wei Hung, Yen-Chun Chen, Shao-Hua Sun

TL;DR

The paper introduces LLM-GS, a framework that harnesses large language models to bootstrap sample-efficient programmatic reinforcement learning. By coupling a Pythonic-DSL translation pathway with a budget-aware Scheduled Hill Climbing search, it seeds the PRL search with LLM-produced programs and progressively explores the program space to maximize episodic return. In Karel and Minigrid experiments, LLM-GS substantially improves sample efficiency over state-of-the-art PRL baselines and demonstrates extensibility to tasks described in natural language. These results suggest a practical pathway to more interpretable, generalizable PRL policies with dramatically fewer environment interactions.

Abstract

Programmatic reinforcement learning (PRL) has been explored for representing policies through programs as a means to achieve interpretability and generalization. Despite promising outcomes, current state-of-the-art PRL methods are hindered by sample inefficiency, necessitating tens of millions of program-environment interactions. To tackle this challenge, we introduce a novel LLM-guided search framework (LLM-GS). Our key insight is to leverage the programming expertise and common sense reasoning of LLMs to enhance the efficiency of assumption-free, random-guessing search methods. We address the challenge of LLMs' inability to generate precise and grammatically correct programs in domain-specific languages (DSLs) by proposing a Pythonic-DSL strategy - an LLM is instructed to initially generate Python codes and then convert them into DSL programs. To further optimize the LLM-generated programs, we develop a search algorithm named Scheduled Hill Climbing, designed to efficiently explore the programmatic search space to improve the programs consistently. Experimental results in the Karel domain demonstrate our LLM-GS framework's superior effectiveness and efficiency. Extensive ablation studies further verify the critical role of our Pythonic-DSL strategy and Scheduled Hill Climbing algorithm. Moreover, we conduct experiments with two novel tasks, showing that LLM-GS enables users without programming skills and knowledge of the domain or DSL to describe the tasks in natural language to obtain performant programs.

Synthesizing Programmatic Reinforcement Learning Policies with Large Language Model Guided Search

TL;DR

The paper introduces LLM-GS, a framework that harnesses large language models to bootstrap sample-efficient programmatic reinforcement learning. By coupling a Pythonic-DSL translation pathway with a budget-aware Scheduled Hill Climbing search, it seeds the PRL search with LLM-produced programs and progressively explores the program space to maximize episodic return. In Karel and Minigrid experiments, LLM-GS substantially improves sample efficiency over state-of-the-art PRL baselines and demonstrates extensibility to tasks described in natural language. These results suggest a practical pathway to more interpretable, generalizable PRL policies with dramatically fewer environment interactions.

Abstract

Programmatic reinforcement learning (PRL) has been explored for representing policies through programs as a means to achieve interpretability and generalization. Despite promising outcomes, current state-of-the-art PRL methods are hindered by sample inefficiency, necessitating tens of millions of program-environment interactions. To tackle this challenge, we introduce a novel LLM-guided search framework (LLM-GS). Our key insight is to leverage the programming expertise and common sense reasoning of LLMs to enhance the efficiency of assumption-free, random-guessing search methods. We address the challenge of LLMs' inability to generate precise and grammatically correct programs in domain-specific languages (DSLs) by proposing a Pythonic-DSL strategy - an LLM is instructed to initially generate Python codes and then convert them into DSL programs. To further optimize the LLM-generated programs, we develop a search algorithm named Scheduled Hill Climbing, designed to efficiently explore the programmatic search space to improve the programs consistently. Experimental results in the Karel domain demonstrate our LLM-GS framework's superior effectiveness and efficiency. Extensive ablation studies further verify the critical role of our Pythonic-DSL strategy and Scheduled Hill Climbing algorithm. Moreover, we conduct experiments with two novel tasks, showing that LLM-GS enables users without programming skills and knowledge of the domain or DSL to describe the tasks in natural language to obtain performant programs.
Paper Structure (86 sections, 4 equations, 25 figures, 10 tables, 2 algorithms)

This paper contains 86 sections, 4 equations, 25 figures, 10 tables, 2 algorithms.

Figures (25)

  • Figure 1: The Karel DSL grammar. It describes the Karel domain-specific language's actions, perceptions, and control flows. The domain-specific language is obtained from liu2023hprl.
  • Figure 2: An example Karel task -- DoorKey. The agent first needs to find the key (marker) in the left room, which will open the door (wall) to the right room. Navigating to the goal marker in the right room and placing the picked marker on it will grant the full reward for the task. This sparse-reward task has been found to pose significant challenges to previous PRL methods, as it necessitates a greater capability in long-horizon strategy formulation.
  • Figure 3: Large language model-guided search (LLM-GS). (a) With task description and the Pythonic-DSL instruction, LLM generates Python programs that are subsequently converted to DSL programs. (b) These initial programs serve as the initial population of our proposed Scheduled Hill Climbing, which evaluates the episodic return of the neighboring programs to update the current candidate program with increasing neighborhood size over search steps.
  • Figure 4: Efficiency in the Karel tasks. We compare our proposed LLM-guided search (LLM-GS) framework against existing methods, LEAPS, HPRL, CEBS, and HC in the Karel and Karel-Hard problem sets. The results show that our LLM-GS is significantly more efficient than these methods.
  • Figure 5: Example on DoorKey. This example shows how our search method improves an LLM-initialized program to an optimal one. The original program (left) has a two-stage structure but lacks navigation ability. The improved program (right) solves this by enhancing its navigating ability on both stages, allowing for solving the task.
  • ...and 20 more figures