Table of Contents
Fetching ...

Plan Verification for LLM-Based Embodied Task Completion Agents

Ananth Hariharan, Vardhan Dongre, Dilek Hakkani-Tür, Gokhan Tur

TL;DR

The paper tackles noisy LLM-generated embodied task plans by introducing a language-based Judge–Planner verification loop that iteratively critiques and revises action sequences. The method uses zero-shot prompts, is model-agnostic, and demonstrates rapid convergence with most plans resolving within three iterations; across four Judge LLMs, the approach achieves up to $\text{Recall} = 0.90$ and $\text{Precision} = 1.00$, prompting a substantial improvement in plan quality. The framework is formalized with a verification operator $V = P\circ J$ and a convergence property $\mathbb{E}[E^{(k+1)}] \le (1-\delta) \mathbb{E}[E^{(k)}]$, and it preserves human error-recovery patterns, enabling scalable augmentation of imitation-learning data. By providing interpretable, line-by-line rationales and a robust verification pathway, the approach offers a practical route to higher-quality demonstrations and more reliable downstream learning for embodied AI.

Abstract

Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy, with unnecessary actions, redundant navigation, and logical errors that reduce policy quality. We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions, yielding progressively cleaner and more spatially coherent trajectories. Unlike rule-based approaches, our method relies on natural language prompting, enabling broad generalization across error types including irrelevant actions, contradictions, and missing steps. On a set of manually annotated actions from the TEACh embodied AI dataset, our framework achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, while improving both temporal efficiency and spatial action organization. Crucially, the method preserves human error-recovery patterns rather than collapsing them, supporting future work on robust corrective behavior. By establishing plan verification as a reliable LLM capability for spatial planning and action refinement, we provide a scalable path to higher-quality training data for imitation learning in embodied AI.

Plan Verification for LLM-Based Embodied Task Completion Agents

TL;DR

The paper tackles noisy LLM-generated embodied task plans by introducing a language-based Judge–Planner verification loop that iteratively critiques and revises action sequences. The method uses zero-shot prompts, is model-agnostic, and demonstrates rapid convergence with most plans resolving within three iterations; across four Judge LLMs, the approach achieves up to and , prompting a substantial improvement in plan quality. The framework is formalized with a verification operator and a convergence property , and it preserves human error-recovery patterns, enabling scalable augmentation of imitation-learning data. By providing interpretable, line-by-line rationales and a robust verification pathway, the approach offers a practical route to higher-quality demonstrations and more reliable downstream learning for embodied AI.

Abstract

Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy, with unnecessary actions, redundant navigation, and logical errors that reduce policy quality. We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions, yielding progressively cleaner and more spatially coherent trajectories. Unlike rule-based approaches, our method relies on natural language prompting, enabling broad generalization across error types including irrelevant actions, contradictions, and missing steps. On a set of manually annotated actions from the TEACh embodied AI dataset, our framework achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, while improving both temporal efficiency and spatial action organization. Crucially, the method preserves human error-recovery patterns rather than collapsing them, supporting future work on robust corrective behavior. By establishing plan verification as a reliable LLM capability for spatial planning and action refinement, we provide a scalable path to higher-quality training data for imitation learning in embodied AI.

Paper Structure

This paper contains 29 sections, 8 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Diagram of Planning Agent and Judge LLM Interaction Process for Plan Verification
  • Figure 2: Diagram of Sample Workflow in TEACh Dataset
  • Figure 3: Cumulative convergence of action sequences across iterations. Most sequences (62%) are corrected after the first iteration, with near-complete convergence (96.5%) by iteration 3.