SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge

Rishi Hazra; Pedro Zuidberg Dos Martires; Luc De Raedt

SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge

Rishi Hazra, Pedro Zuidberg Dos Martires, Luc De Raedt

TL;DR

The paper addresses long-horizon planning with large language models (LLMs) by reframing LM planning within a heuristic search framework and introducing SayCanPay, which jointly scores actions by Say (LM likelihood), Can (feasibility grounding), and Pay (long-term payoff). It trains domain-specific Can and Pay models from expert trajectories and performs offline Beam-Action search to produce action sequences that are both feasible and cost-effective. Across Ravens, BabyAI, and VirtualHome, SayCanPay with Beam-Action achieves higher planning success and cost-effectiveness than prior LLM planning approaches, with some environments showing improved generalization. By integrating learnable domain knowledge with heuristic search, the work advances planning with LLMs and demonstrates practical improvements over purely unguided LM planning.

Abstract

Large Language Models (LLMs) have demonstrated impressive planning abilities due to their vast "world knowledge". Yet, obtaining plans that are both feasible (grounded in affordances) and cost-effective (in plan length), remains a challenge, despite recent progress. This contrasts with heuristic planning methods that employ domain knowledge (formalized in action models such as PDDL) and heuristic search to generate feasible, optimal plans. Inspired by this, we propose to combine the power of LLMs and heuristic planning by leveraging the world knowledge of LLMs and the principles of heuristic search. Our approach, SayCanPay, employs LLMs to generate actions (Say) guided by learnable domain knowledge, that evaluates actions' feasibility (Can) and long-term reward/payoff (Pay), and heuristic search to select the best sequence of actions. Our contributions are (1) a novel framing of the LLM planning problem in the context of heuristic planning, (2) integrating grounding and cost-effective elements into the generated plans, and (3) using heuristic search over actions. Our extensive evaluations show that our model surpasses other LLM planning approaches.

SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge

TL;DR

Abstract

Paper Structure (24 sections, 6 equations, 4 figures, 7 tables)

This paper contains 24 sections, 6 equations, 4 figures, 7 tables.

Introduction
Related Work on Planning with LLMs
Preliminaries
Planning Framework.
Heuristic Search Planning.
Language Model Planning Framework
SayCanPay Inference
Greedy-Action
Beam-Action
Learning the Can and Pay Models
Can Model
Pay Model
Experimental Setup
Say Model
Environments
...and 9 more sections

Figures (4)

Figure 1: Figure illustrates how SayCanPay scores the next action in BabyAI environment babyai_iclr19. Given inputs: goal $g$ and initial observation $o_0$, the Say model generates candidate actions with associated probabilities. These are then scored for feasibility by the Can model and for payoff by the Pay model. Here, the Can model deems both pick up red key and pick up green ball equally probable (i.e. both preconditions are satisfied). However, the Pay model ensures a better payoff for pick up green ball. We compare plans generated by Say, SayCan, and SayCanPay scoring. Say scoring can lead to infeasible plans and SayCan to feasible but longer plans. The displayed grid is purely illustrative, with no visual inputs used.
Figure 2: The figure outlines decoding strategies -- Greedy-Token, Greedy-Action, and Beam-Action. Greedy-Token greedily selects the next best token by its probability. Greedy-Action (which is a beam search over tokens) greedily selects the next best action based on a specific decoding score. Beam-Action uses a beam search over actions, maintaining $k$ beams and selecting the best sequence as the plan. Here, nodes represent either tokens $w_t$ or actions $a_t$. The best plan is given by $(a_1^\ast, a_2^\ast, a_3^\ast)$ and represented in red. The second-best node is in orange, discarded ones in black. Here, for Beam-Action, $m=3$ and $k=2$.
Figure 3: [Best viewed in color] From left to right: Planning success, cost-effectiveness, generalization for different beam sizes. Note, that generalization on the test-generalize split for VirtualHome is reported as a percentage.
Figure 4: [Best viewed in color] The error plot represents the variance in relative length over models Vicuna and Flan-T5. Due to the open-ended nature of VirtualHome, the crowdsourced trajectories are not optimal, which explains why certain cases have a relative length $>1.0$. Note that Greedy-Token decoding in VirtualHome has a relative length $=0$ since no generated plans were executed successfully for both Vicuna and Flan-T5.

SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge

TL;DR

Abstract

SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge

Authors

TL;DR

Abstract

Table of Contents

Figures (4)