Table of Contents
Fetching ...

FlowPlan: Zero-Shot Task Planning with LLM Flow Engineering for Robotic Instruction Following

Zijun Lin, Chao Tang, Hanjing Ye, Hong Zhang

TL;DR

FlowPlan addresses the challenge of zero-shot robotic instruction following by introducing a structured four-stage LLM workflow (Task Information Retrieval, Language-Level Reasoning, Symbolic-Level Planning, Logical Evaluation) coupled with context-aligned target localization built on an online semantic map. This modular design enables robust grounding of lengthy instructions under operational constraints without labeled data, achieving strong performance on ALFRED and successful real-world deployments. The key contributions include a formalized multi-stage planning process, a context-aware grounding mechanism, and comprehensive ablations that underscore the importance of each component. The approach offers practical impact by reducing data requirements and adapting across diverse environments, with potential extensions to vision-language fusion and open-vocabulary perception.

Abstract

Robotic instruction following tasks require seamless integration of visual perception, task planning, target localization, and motion execution. However, existing task planning methods for instruction following are either data-driven or underperform in zero-shot scenarios due to difficulties in grounding lengthy instructions into actionable plans under operational constraints. To address this, we propose FlowPlan, a structured multi-stage LLM workflow that elevates zero-shot pipeline and bridges the performance gap between zero-shot and data-driven in-context learning methods. By decomposing the planning process into modular stages--task information retrieval, language-level reasoning, symbolic-level planning, and logical evaluation--FlowPlan generates logically coherent action sequences while adhering to operational constraints and further extracts contextual guidance for precise instance-level target localization. Benchmarked on the ALFRED and validated in real-world applications, our method achieves competitive performance relative to data-driven in-context learning methods and demonstrates adaptability across diverse environments. This work advances zero-shot task planning in robotic systems without reliance on labeled data. Project website: https://instruction-following-project.github.io/.

FlowPlan: Zero-Shot Task Planning with LLM Flow Engineering for Robotic Instruction Following

TL;DR

FlowPlan addresses the challenge of zero-shot robotic instruction following by introducing a structured four-stage LLM workflow (Task Information Retrieval, Language-Level Reasoning, Symbolic-Level Planning, Logical Evaluation) coupled with context-aligned target localization built on an online semantic map. This modular design enables robust grounding of lengthy instructions under operational constraints without labeled data, achieving strong performance on ALFRED and successful real-world deployments. The key contributions include a formalized multi-stage planning process, a context-aware grounding mechanism, and comprehensive ablations that underscore the importance of each component. The approach offers practical impact by reducing data requirements and adapting across diverse environments, with potential extensions to vision-language fusion and open-vocabulary perception.

Abstract

Robotic instruction following tasks require seamless integration of visual perception, task planning, target localization, and motion execution. However, existing task planning methods for instruction following are either data-driven or underperform in zero-shot scenarios due to difficulties in grounding lengthy instructions into actionable plans under operational constraints. To address this, we propose FlowPlan, a structured multi-stage LLM workflow that elevates zero-shot pipeline and bridges the performance gap between zero-shot and data-driven in-context learning methods. By decomposing the planning process into modular stages--task information retrieval, language-level reasoning, symbolic-level planning, and logical evaluation--FlowPlan generates logically coherent action sequences while adhering to operational constraints and further extracts contextual guidance for precise instance-level target localization. Benchmarked on the ALFRED and validated in real-world applications, our method achieves competitive performance relative to data-driven in-context learning methods and demonstrates adaptability across diverse environments. This work advances zero-shot task planning in robotic systems without reliance on labeled data. Project website: https://instruction-following-project.github.io/.

Paper Structure

This paper contains 14 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview. In robotic instruction following tasks, a robot performs a series of planning steps and movements within an unfamiliar environment to achieve specific objectives. FlowPlan integrates multi-stage task planning with context-aligned target localization. The former involves task information retrieval, language-level reasoning, symbolic-level planning, and logical evaluation. The latter utilizes an online-constructed semantic map of the scene to locate targets for navigation, which predicts object co-location probabilities that are refined with contextual guidance derived from the instructions.
  • Figure 2: Multi-Stage Task Planning. The multi-stage task planning process is comprised of four interpretable stages— task information retrieval, language-level reasoning, symbolic planning, and logical evaluation— to generate logically coherent task plans under operational constraints. All stages are managed by LLM-driven components and do not require labeled data or example sets.
  • Figure 3: Context-Aligned Target Localization. The target localization process consists of two key components: object co-location and context alignment. The former produces a probability distribution at the category level, while the latter utilizes guidance from instructions to align with a specific target instance.
  • Figure 4: Visualization of Action Sequences. Our method generates logically coherent task plans and executes them effectively, whereas Inoue's method inoue2022prompter fails.
  • Figure 5: Real-world instruction following experiments.