Bridging LLM Planning Agents and Formal Methods: A Case Study in Plan Verification

Keshav Ramani; Vali Tawosi; Salwa Alamir; Daniel Borrajo

Bridging LLM Planning Agents and Formal Methods: A Case Study in Plan Verification

Keshav Ramani, Vali Tawosi, Salwa Alamir, Daniel Borrajo

TL;DR

The paper addresses the problem of giving formal guarantees to natural language plan outputs by translating NL plans into formal representations, namely Kripke structures $K=(S,S_0,R,L)$ and Linear Temporal Logic (LTL) specifications, and then applying model checking with a tool like NuSMV. The authors propose a two-component framework (Understanding and Reasoning) that uses LLMs to perform the NL-to-formal translation and a deterministic verifier to assess plan validity, adapting PlanBench for evaluation and employing $F(goal)$ to express eventual goal satisfaction. An empirical comparison between GPT-4o and GPT-5 on the PlanBench task shows GPT-5 achieves near-perfect formal representations with high accuracy (e.g., an adjudicated $F1$ around 96%), highlighting the potential of LLM-driven formal verification for plan reliability while acknowledging remaining semantic gaps and unknown outputs. The work demonstrates a significant step toward bridging LLM-based planning with formal methods, with implications for safer, verifiable planning in AI systems and potential extensions to software and circuit verification.

Abstract

We introduce a novel framework for evaluating the alignment between natural language plans and their expected behavior by converting them into Kripke structures and Linear Temporal Logic (LTL) using Large Language Models (LLMs) and performing model checking. We systematically evaluate this framework on a simplified version of the PlanBench plan verification dataset and report on metrics like Accuracy, Precision, Recall and F1 scores. Our experiments demonstrate that GPT-5 achieves excellent classification performance (F1 score of 96.3%) while almost always producing syntactically perfect formal representations that can act as guarantees. However, the synthesis of semantically perfect formal models remains an area for future exploration.

Bridging LLM Planning Agents and Formal Methods: A Case Study in Plan Verification

TL;DR

The paper addresses the problem of giving formal guarantees to natural language plan outputs by translating NL plans into formal representations, namely Kripke structures

and Linear Temporal Logic (LTL) specifications, and then applying model checking with a tool like NuSMV. The authors propose a two-component framework (Understanding and Reasoning) that uses LLMs to perform the NL-to-formal translation and a deterministic verifier to assess plan validity, adapting PlanBench for evaluation and employing

to express eventual goal satisfaction. An empirical comparison between GPT-4o and GPT-5 on the PlanBench task shows GPT-5 achieves near-perfect formal representations with high accuracy (e.g., an adjudicated

around 96%), highlighting the potential of LLM-driven formal verification for plan reliability while acknowledging remaining semantic gaps and unknown outputs. The work demonstrates a significant step toward bridging LLM-based planning with formal methods, with implications for safer, verifiable planning in AI systems and potential extensions to software and circuit verification.

Bridging LLM Planning Agents and Formal Methods: A Case Study in Plan Verification

TL;DR

Abstract

Bridging LLM Planning Agents and Formal Methods: A Case Study in Plan Verification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)