Table of Contents
Fetching ...

Socratic Planner: Self-QA-Based Zero-Shot Planning for Embodied Instruction Following

Suyeon Shin, Sujin jeon, Junghyun Kim, Gi-Cheon Kang, Byoung-Tak Zhang

TL;DR

This work tackles Embodied Instruction Following (EIF) by introducing the Socratic Planner, a zero-shot planning framework that uses self-QA with a Large Language Model to decompose instructions into subgoals (Socratic Task Decomposer) and generate action sequences. It couples this with Vision-based Socratic Re-planning, which grounds reasoning in dense visual feedback from the environment to adjust plans on the fly. Across the ALFRED benchmark, the approach outperforms state-of-the-art zero-shot and few-shot methods, with pronounced gains on long-horizon tasks, and it demonstrates real-world viability on a UR5e robot. By eliminating labeled data requirements and leveraging structured, QA-driven decomposition plus multimodal feedback, the method offers a practical path toward robust, scalable embodied instruction following.

Abstract

Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in interactive environments. A key challenge in EIF is compositional task planning, typically addressed through supervised learning or few-shot in-context learning with labeled data. To this end, we introduce the Socratic Planner, a self-QA-based zero-shot planning method that infers an appropriate plan without any further training. The Socratic Planner first facilitates self-questioning and answering by the Large Language Model (LLM), which in turn helps generate a sequence of subgoals. While executing the subgoals, an embodied agent may encounter unexpected situations, such as unforeseen obstacles. The Socratic Planner then adjusts plans based on dense visual feedback through a visually-grounded re-planning mechanism. Experiments demonstrate the effectiveness of the Socratic Planner, outperforming current state-of-the-art planning models on the ALFRED benchmark across all metrics, particularly excelling in long-horizon tasks that demand complex inference. We further demonstrate its real-world applicability through deployment on a physical robot for long-horizon tasks.

Socratic Planner: Self-QA-Based Zero-Shot Planning for Embodied Instruction Following

TL;DR

This work tackles Embodied Instruction Following (EIF) by introducing the Socratic Planner, a zero-shot planning framework that uses self-QA with a Large Language Model to decompose instructions into subgoals (Socratic Task Decomposer) and generate action sequences. It couples this with Vision-based Socratic Re-planning, which grounds reasoning in dense visual feedback from the environment to adjust plans on the fly. Across the ALFRED benchmark, the approach outperforms state-of-the-art zero-shot and few-shot methods, with pronounced gains on long-horizon tasks, and it demonstrates real-world viability on a UR5e robot. By eliminating labeled data requirements and leveraging structured, QA-driven decomposition plus multimodal feedback, the method offers a practical path toward robust, scalable embodied instruction following.

Abstract

Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in interactive environments. A key challenge in EIF is compositional task planning, typically addressed through supervised learning or few-shot in-context learning with labeled data. To this end, we introduce the Socratic Planner, a self-QA-based zero-shot planning method that infers an appropriate plan without any further training. The Socratic Planner first facilitates self-questioning and answering by the Large Language Model (LLM), which in turn helps generate a sequence of subgoals. While executing the subgoals, an embodied agent may encounter unexpected situations, such as unforeseen obstacles. The Socratic Planner then adjusts plans based on dense visual feedback through a visually-grounded re-planning mechanism. Experiments demonstrate the effectiveness of the Socratic Planner, outperforming current state-of-the-art planning models on the ALFRED benchmark across all metrics, particularly excelling in long-horizon tasks that demand complex inference. We further demonstrate its real-world applicability through deployment on a physical robot for long-horizon tasks.
Paper Structure (19 sections, 6 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison between existing EIF methods and the Socratic Planner. The Socratic Planner enriches zero-shot task planning through self-questioning and answering to extract substructural information. Based on QA conversations, a Large Language Model (LLM) generates subgoals. Solid black arrows denote static planning paths, while dashed blue arrows represent re-planning paths upon encountering execution failures, with visually grounded feedback from the Multimodal Large Language Model (MLLM) guiding LLM to modify the plan.
  • Figure 2: Input prompt and output of the Socratic Task Decomposer.
  • Figure 3: Comparison with LLM-Planner on zero-shot and few-shot settings in EIF metrics, where valid seen splits are used for evaluation.
  • Figure 4: Experimental setup (left) and Robot's camera view (right)
  • Figure 5: Visualization of the agent action sequence acquired by Socratic Planner without STD (top right) and our Socratic Planner (bottom).
  • ...and 1 more figures