Table of Contents
Fetching ...

Novelty Adaptation Through Hybrid Large Language Model (LLM)-Symbolic Planning and LLM-guided Reinforcement Learning

Hong Lu, Pierrick Lorang, Timothy R. Duggan, Jivko Sinapov, Matthias Scheutz

Abstract

In dynamic open-world environments, autonomous agents often encounter novelties that hinder their ability to find plans to achieve their goals. Specifically, traditional symbolic planners fail to generate plans when the robot's planning domain lacks the operators that enable it to interact appropriately with novel objects in the environment. We propose a neuro-symbolic architecture that integrates symbolic planning, reinforcement learning, and a large language model (LLM) to learn how to handle novel objects. In particular, we leverage the common sense reasoning capability of the LLM to identify missing operators, generate plans with the symbolic AI planner, and write reward functions to guide the reinforcement learning agent in learning control policies for newly identified operators. Our method outperforms the state-of-the-art methods in operator discovery as well as operator learning in continuous robotic domains.

Novelty Adaptation Through Hybrid Large Language Model (LLM)-Symbolic Planning and LLM-guided Reinforcement Learning

Abstract

In dynamic open-world environments, autonomous agents often encounter novelties that hinder their ability to find plans to achieve their goals. Specifically, traditional symbolic planners fail to generate plans when the robot's planning domain lacks the operators that enable it to interact appropriately with novel objects in the environment. We propose a neuro-symbolic architecture that integrates symbolic planning, reinforcement learning, and a large language model (LLM) to learn how to handle novel objects. In particular, we leverage the common sense reasoning capability of the LLM to identify missing operators, generate plans with the symbolic AI planner, and write reward functions to guide the reinforcement learning agent in learning control policies for newly identified operators. Our method outperforms the state-of-the-art methods in operator discovery as well as operator learning in continuous robotic domains.
Paper Structure (16 sections, 6 figures, 2 tables, 3 algorithms)

This paper contains 16 sections, 6 figures, 2 tables, 3 algorithms.

Figures (6)

  • Figure 1: The plan-learn-execute loop. The Hybrid LLM Symbolic planner parses the domain PDDL and problem PDDL files, prompts the LLM to structurally define the missing operator(s) in PDDL, and finds a plan with grounded operators. It lifts the grounded operators, and outputs the modified domain PDDL file with the added lifted operator definitions and the plan. The LLM's newly defined operators are in blue while the existing operators are in black. The learn-execute loop starts by executing the operators in the plan. When it encounters a newly defined operator for which an executor policy does not yet exist, it prompts the LLM to generate dense reward shaping function candidates and launches RL agents to learn a policy for the operator. The effects of the operator are treated as sub-goals for the RL agents. One agent is launched per dense reward function candidate and the worst performing agent is eliminated periodically based on sub-goal success rates. The sub-goals are trained in phases. Once training is complete, the best performing policy is saved as an executor object. The pseudocode for the algorithm is shown in Algorithm \ref{['alg:plan_learn_execute']}.
  • Figure 2: Problem domains and hybrid LLM-symbolic planner outputs. Domains are ordered by the difficulty of discovering a plannable state via random exploration. Green arrows indicate injected novelties: a lid (Kitchen), a round peg (Nut Assembly), and a drawer or box (Coffee). In the planning graph, green nodes represent states reached via LLM-suggested operators, blue nodes denote existing operators, and orange nodes indicate where the search-ahead algorithm finds a valid plan. Identified missing operators (shown in blue text) include: pick-up-lid-from-pot (Kitchen), pick-up-nut-from-peg (Nut Assembly), pick-up-from-open-box (Coffee-Box), and open-drawer and pick-up-from-drawer (Coffee-Drawer).
  • Figure 3: The Prompt-LLM-for-New-Operator Pipeline. We sample five operator candidates and select the majority to improve accuracy wang_self-consistency_2023. Using dynamic prompts containing the current state, goal, and existing operators, the LLM suggests names and parameters for missing operators. Preconditions are automatically filled with grounded predicates involving these parameters, while the LLM defines and orders the effects. For example, if preconditions for open-drawer include not open drawer1 and not grasped drawer1, the LLM generates grasped drawer1 and open drawer1 as ordered effects, which subsequently serve as sub-goals for the guided learning stage. We observe that errors in effects ordering and operator generation are greatly reduced through self-consistency wang_self-consistency_2023. Finally, the grounded operator is lifted into a general PDDL definition by mapping specific entities to their variable types (e.g., drawer1 to ?d - drawer), enabling domain-wide generalization.
  • Figure 4: The LLM Guided Sub-goal Learning Pipeline. Three reward shaping function candidates are sampled from the LLM. The prompt contains the template for a reward shaping function candidate. Information such as definition of the grounded operator and the observation space of the robot is dynamically filled into the template. The LLM writes a function to compute a velocity based progress using the relevant observations in the robot's numeric observation space. In this example, the function computes the progress for open drawer1 using the distance between the drawer handle and the cabinet. During each sub-goal's training phase, the reward shaping for the sub-goal is unlocked. The worst performing candidate is periodically eliminated.
  • Figure 5: Comparison of the LLM Guided (LG) Sub-goal Learning with the LEAGUE-Sparse (LS) cheng_league_2023 and Reward Machine (RM) icarte2022reward Baselines. We record two metrics: success rate of the operator and progress towards the completion. Progress is the percent of sub-goals achieved. Metrics are averaged across ten seeds with standard error of the mean (SEM).
  • ...and 1 more figures