Table of Contents
Fetching ...

Long-horizon Embodied Planning with Implicit Logical Inference and Hallucination Mitigation

Siyuan Liu, Jiawei Du, Sicheng Xiang, Zibo Wang, Dingsheng Luo

TL;DR

This work presents ReLEP, a novel framework for Real-time Long-horizon Embodied Planning that demonstrates high success rates and compliance to execution even on unseen tasks and outperforms state-of-the-art baseline methods.

Abstract

Long-horizon embodied planning underpins embodied AI. To accomplish long-horizon tasks, one of the most feasible ways is to decompose abstract instructions into a sequence of actionable steps. Foundation models still face logical errors and hallucinations in long-horizon planning, unless provided with highly relevant examples to the tasks. However, providing highly relevant examples for any random task is unpractical. Therefore, we present ReLEP, a novel framework for Real-time Long-horizon Embodied Planning. ReLEP can complete a wide range of long-horizon tasks without in-context examples by learning implicit logical inference through fine-tuning. The fine-tuned large vision-language model formulates plans as sequences of skill functions. These functions are selected from a carefully designed skill library. ReLEP is also equipped with a Memory module for plan and status recall, and a Robot Configuration module for versatility across robot types. In addition, we propose a data generation pipeline to tackle dataset scarcity. When constructing the dataset, we considered the implicit logical relationships, enabling the model to learn implicit logical relationships and dispel hallucinations. Through comprehensive evaluations across various long-horizon tasks, ReLEP demonstrates high success rates and compliance to execution even on unseen tasks and outperforms state-of-the-art baseline methods.

Long-horizon Embodied Planning with Implicit Logical Inference and Hallucination Mitigation

TL;DR

This work presents ReLEP, a novel framework for Real-time Long-horizon Embodied Planning that demonstrates high success rates and compliance to execution even on unseen tasks and outperforms state-of-the-art baseline methods.

Abstract

Long-horizon embodied planning underpins embodied AI. To accomplish long-horizon tasks, one of the most feasible ways is to decompose abstract instructions into a sequence of actionable steps. Foundation models still face logical errors and hallucinations in long-horizon planning, unless provided with highly relevant examples to the tasks. However, providing highly relevant examples for any random task is unpractical. Therefore, we present ReLEP, a novel framework for Real-time Long-horizon Embodied Planning. ReLEP can complete a wide range of long-horizon tasks without in-context examples by learning implicit logical inference through fine-tuning. The fine-tuned large vision-language model formulates plans as sequences of skill functions. These functions are selected from a carefully designed skill library. ReLEP is also equipped with a Memory module for plan and status recall, and a Robot Configuration module for versatility across robot types. In addition, we propose a data generation pipeline to tackle dataset scarcity. When constructing the dataset, we considered the implicit logical relationships, enabling the model to learn implicit logical relationships and dispel hallucinations. Through comprehensive evaluations across various long-horizon tasks, ReLEP demonstrates high success rates and compliance to execution even on unseen tasks and outperforms state-of-the-art baseline methods.
Paper Structure (25 sections, 2 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 2 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: An overview of ReLEP. Given an instruction and a current scene image, a fine-tuned large vision-language model formulates plans as sequences of skill functions according to a skill library, a Memory module, and a Robot Configuration module. Then, the robot executes the first step and saves executed steps and past plans into the Memory module for subsequent rounds of planning.
  • Figure 2: Four types of errors GPT-4V made in long-horizon embodied planning with loosely relevant examples. Logical understanding errors are misinterpretations of individual skills. Missing skills can lead to skill combination errors. Logical mistakes result in logical planning errors. Hallucinations include fabricating undefined skills.
  • Figure A1: Data acquisition pipeline. We collected images from indoor object detection datasets and used GPT-4V to generate possible tasks a robot may perform on these scenes. We then manually refined the generated tasks and asked GPT-4V to predict corresponding plans. Finally, by manually refining the generated plans, we acquire data triplets of task, plan, and image.
  • Figure A2: Illustration of the planning of ReLEP on the Bring Water task in the real-world experiment using collected images. Steps that would not change the environment image are omitted during testing.
  • Figure A3: An outline of ReLEP's system prompt.
  • ...and 6 more figures