Table of Contents
Fetching ...

Bridging LLM Planning Agents and Formal Methods: A Case Study in Plan Verification

Keshav Ramani, Vali Tawosi, Salwa Alamir, Daniel Borrajo

TL;DR

The paper addresses the problem of giving formal guarantees to natural language plan outputs by translating NL plans into formal representations, namely Kripke structures $K=(S,S_0,R,L)$ and Linear Temporal Logic (LTL) specifications, and then applying model checking with a tool like NuSMV. The authors propose a two-component framework (Understanding and Reasoning) that uses LLMs to perform the NL-to-formal translation and a deterministic verifier to assess plan validity, adapting PlanBench for evaluation and employing $F(goal)$ to express eventual goal satisfaction. An empirical comparison between GPT-4o and GPT-5 on the PlanBench task shows GPT-5 achieves near-perfect formal representations with high accuracy (e.g., an adjudicated $F1$ around 96%), highlighting the potential of LLM-driven formal verification for plan reliability while acknowledging remaining semantic gaps and unknown outputs. The work demonstrates a significant step toward bridging LLM-based planning with formal methods, with implications for safer, verifiable planning in AI systems and potential extensions to software and circuit verification.

Abstract

We introduce a novel framework for evaluating the alignment between natural language plans and their expected behavior by converting them into Kripke structures and Linear Temporal Logic (LTL) using Large Language Models (LLMs) and performing model checking. We systematically evaluate this framework on a simplified version of the PlanBench plan verification dataset and report on metrics like Accuracy, Precision, Recall and F1 scores. Our experiments demonstrate that GPT-5 achieves excellent classification performance (F1 score of 96.3%) while almost always producing syntactically perfect formal representations that can act as guarantees. However, the synthesis of semantically perfect formal models remains an area for future exploration.

Bridging LLM Planning Agents and Formal Methods: A Case Study in Plan Verification

TL;DR

The paper addresses the problem of giving formal guarantees to natural language plan outputs by translating NL plans into formal representations, namely Kripke structures and Linear Temporal Logic (LTL) specifications, and then applying model checking with a tool like NuSMV. The authors propose a two-component framework (Understanding and Reasoning) that uses LLMs to perform the NL-to-formal translation and a deterministic verifier to assess plan validity, adapting PlanBench for evaluation and employing to express eventual goal satisfaction. An empirical comparison between GPT-4o and GPT-5 on the PlanBench task shows GPT-5 achieves near-perfect formal representations with high accuracy (e.g., an adjudicated around 96%), highlighting the potential of LLM-driven formal verification for plan reliability while acknowledging remaining semantic gaps and unknown outputs. The work demonstrates a significant step toward bridging LLM-based planning with formal methods, with implications for safer, verifiable planning in AI systems and potential extensions to software and circuit verification.

Abstract

We introduce a novel framework for evaluating the alignment between natural language plans and their expected behavior by converting them into Kripke structures and Linear Temporal Logic (LTL) using Large Language Models (LLMs) and performing model checking. We systematically evaluate this framework on a simplified version of the PlanBench plan verification dataset and report on metrics like Accuracy, Precision, Recall and F1 scores. Our experiments demonstrate that GPT-5 achieves excellent classification performance (F1 score of 96.3%) while almost always producing syntactically perfect formal representations that can act as guarantees. However, the synthesis of semantically perfect formal models remains an area for future exploration.

Paper Structure

This paper contains 8 sections, 2 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overview of the LLM-driven plan-specification alignment framework. The Planbench plan verification task contains natural language descriptions of goals, intents, and the environment along with the plan. The LLM addresses the task of understanding such natural language inputs and converting them to a Kripke structure (represented in the NuSMV format) and an LTL specification. For reasoning, we parse this output and provide it to the NuSMV model checker.