Table of Contents
Fetching ...

LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents

Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, Minsu Jang

TL;DR

LoTa-Bench introduces a quantitative benchmark for evaluating language-oriented task planners in embodied home-service agents, pairing ALFRED/AI2-THOR and WA H-NL/VirtualHome to enable automatic, reproducible assessment. It analyzes baseline LLM planners across model families, prompt designs, and context lengths, then systematically validates extensions like in-context example selection, NL feedback-driven replanning, and domain-specific fine-tuning. Key findings show that larger models can help but are not universally superior, semantic-similarity-based in-context selection yields meaningful gains, replanning benefits emerge with large models, and in-domain fine-tuning dramatically boosts ALFRED performance but does not transfer well to WA H-NL. The work delivers public code and extended datasets, and outlines limitations such as decoupled planning and low-level grounding, charting a path toward more comprehensive, end-to-end benchmarking for embodied language-oriented planning.

Abstract

Large language models (LLMs) have recently received considerable attention as alternative solutions for task planning. However, comparing the performance of language-oriented task planners becomes difficult, and there exists a dearth of detailed exploration regarding the effects of various factors such as pre-trained model selection and prompt construction. To address this, we propose a benchmark system for automatically quantifying performance of task planning for home-service embodied agents. Task planners are tested on two pairs of datasets and simulators: 1) ALFRED and AI2-THOR, 2) an extension of Watch-And-Help and VirtualHome. Using the proposed benchmark system, we perform extensive experiments with LLMs and prompts, and explore several enhancements of the baseline planner. We expect that the proposed benchmark tool would accelerate the development of language-oriented task planners.

LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents

TL;DR

LoTa-Bench introduces a quantitative benchmark for evaluating language-oriented task planners in embodied home-service agents, pairing ALFRED/AI2-THOR and WA H-NL/VirtualHome to enable automatic, reproducible assessment. It analyzes baseline LLM planners across model families, prompt designs, and context lengths, then systematically validates extensions like in-context example selection, NL feedback-driven replanning, and domain-specific fine-tuning. Key findings show that larger models can help but are not universally superior, semantic-similarity-based in-context selection yields meaningful gains, replanning benefits emerge with large models, and in-domain fine-tuning dramatically boosts ALFRED performance but does not transfer well to WA H-NL. The work delivers public code and extended datasets, and outlines limitations such as decoupled planning and low-level grounding, charting a path toward more comprehensive, end-to-end benchmarking for embodied language-oriented planning.

Abstract

Large language models (LLMs) have recently received considerable attention as alternative solutions for task planning. However, comparing the performance of language-oriented task planners becomes difficult, and there exists a dearth of detailed exploration regarding the effects of various factors such as pre-trained model selection and prompt construction. To address this, we propose a benchmark system for automatically quantifying performance of task planning for home-service embodied agents. Task planners are tested on two pairs of datasets and simulators: 1) ALFRED and AI2-THOR, 2) an extension of Watch-And-Help and VirtualHome. Using the proposed benchmark system, we perform extensive experiments with LLMs and prompts, and explore several enhancements of the baseline planner. We expect that the proposed benchmark tool would accelerate the development of language-oriented task planners.
Paper Structure (22 sections, 2 equations, 12 figures, 8 tables)

This paper contains 22 sections, 2 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Overall benchmarking configuration for LLM-based task planners. NL stands for Natural Language. We used two setups: 1) ALFRED dataset with AI2-THOR simulator and 2) WAH-NL dataset with VirtualHome simulator. Exemplary prompt and skill set are presented on the left side.
  • Figure 2: Baseline results on (a) ALFRED and (b) WAH-NL. We report task success rates (%) on the ALFRED dataset and average subgoal success rate (%) on the WAH-NL datset for language models in different model classes and sizes (number of parameters). Base language models are represented as solid lines. Fine-tuned models (by either instruction or chat data) were shown in a dashed line with a triangle maker.
  • Figure 3: (Subgoal) success rates for the different number of examples for in-context learning.
  • Figure 4: Planning results. Success cases of (a) the ALFRED task and (b) the WAH-NL task when GPT-3 175B model was used. The input instructions, inferred steps, and scene images after each step execution are presented in each figure. The scene images show agent's point of view on ALFRED and third person's point of view on WAH-NL. Additional results, including failure cases, are provided in Appendix \ref{['app:additional_results']}.
  • Figure 5: Subgoal success rate for different in-context example selection strategies. The dashed line represents the best performance of our baseline planner using GPT-3 175B.
  • ...and 7 more figures