Table of Contents
Fetching ...

OPEx: A Component-Wise Analysis of LLM-Centric Agents in Embodied Instruction Following

Haochen Shi, Zhiyuan Sun, Xingdi Yuan, Marc-Alexandre Côté, Bang Liu

TL;DR

OPEx is introduced, a comprehensive framework that delineates the core components essential for solving embodied learning tasks: Observer, Planner, and Executor, and reveals that LLM-centric design markedly improves EIF outcomes, and identifies visual perception and low-level action execution as critical bottlenecks.

Abstract

Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions. Recent advancements have seen a surge in employing large language models (LLMs) within a framework-centric approach to enhance performance in embodied learning tasks, including EIF. Despite these efforts, there exists a lack of a unified understanding regarding the impact of various components-ranging from visual perception to action execution-on task performance. To address this gap, we introduce OPEx, a comprehensive framework that delineates the core components essential for solving embodied learning tasks: Observer, Planner, and Executor. Through extensive evaluations, we provide a deep analysis of how each component influences EIF task performance. Furthermore, we innovate within this space by deploying a multi-agent dialogue strategy on a TextWorld counterpart, further enhancing task performance. Our findings reveal that LLM-centric design markedly improves EIF outcomes, identify visual perception and low-level action execution as critical bottlenecks, and demonstrate that augmenting LLMs with a multi-agent framework further elevates performance.

OPEx: A Component-Wise Analysis of LLM-Centric Agents in Embodied Instruction Following

TL;DR

OPEx is introduced, a comprehensive framework that delineates the core components essential for solving embodied learning tasks: Observer, Planner, and Executor, and reveals that LLM-centric design markedly improves EIF outcomes, and identifies visual perception and low-level action execution as critical bottlenecks.

Abstract

Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions. Recent advancements have seen a surge in employing large language models (LLMs) within a framework-centric approach to enhance performance in embodied learning tasks, including EIF. Despite these efforts, there exists a lack of a unified understanding regarding the impact of various components-ranging from visual perception to action execution-on task performance. To address this gap, we introduce OPEx, a comprehensive framework that delineates the core components essential for solving embodied learning tasks: Observer, Planner, and Executor. Through extensive evaluations, we provide a deep analysis of how each component influences EIF task performance. Furthermore, we innovate within this space by deploying a multi-agent dialogue strategy on a TextWorld counterpart, further enhancing task performance. Our findings reveal that LLM-centric design markedly improves EIF outcomes, identify visual perception and low-level action execution as critical bottlenecks, and demonstrate that augmenting LLMs with a multi-agent framework further elevates performance.
Paper Structure (34 sections, 5 figures, 5 tables)

This paper contains 34 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of our OPEx framework. We will open-source the code after acceptance.
  • Figure 2: Example of a Clean & Place task in ALFRED.
  • Figure 3: Prompt example of the LLM-based Planner. Setup is fixed for all the input test cases, Task is the input to the LLM-based planner that varies for distinct input test cases, Task type, Tought, and Plan are the content required to be generated by the LLM-based planner. The same color mode applies to other figures.
  • Figure 4: Prompt example of the LLM-based Observer.
  • Figure 5: Prompt example of the LLM-based Executor.