Table of Contents
Fetching ...

LERa: Replanning with Visual Feedback in Instruction Following

Svyatoslav Pchelintsev, Maxim Patratskiy, Anatoly Onishchenko, Alexandr Korchemnyi, Aleksandr Medvedev, Uliana Vinogradova, Ilya Galuzinsky, Aleksey Postnikov, Alexey K. Kovalev, Aleksandr I. Panov

TL;DR

This work targets the brittleness of LLM-driven robotic task planning under dynamic changes and execution failures. It introduces LERa—Look, Explain, Replan—a VLM-based replanner that relies on a single RGB image $O_t$, instruction $I$, initial plan $P$, and a failure signal $E_t$ to produce a revised plan $P'$ without requiring object detections or preconditions. Across ALFRED-ChaOS, VirtualHome-ChaOS, TableTop PyBullet, and real-robot experiments, LERa significantly improves success rates (e.g., up to 94% SR in VirtualHome-ChaOS and up to 67% gains in PyBullet) and demonstrates robustness to imperfect error checking. Ablations and VLM-variant analyses reveal the necessity of the three-step Look–Explain–Replan process and highlight how VLM quality and checker reliability affect performance. The inclusion of ALFRED-ChaOS and VirtualHome-ChaOS provides practical benchmarks, and real-world robot trials validate LERa’s applicability to real tasks, making it a robust, adaptable solution for error-aware robotic task execution.

Abstract

Large Language Models are increasingly used in robotics for task planning, but their reliance on textual inputs limits their adaptability to real-world changes and failures. To address these challenges, we propose LERa - Look, Explain, Replan - a Visual Language Model-based replanning approach that utilizes visual feedback. Unlike existing methods, LERa requires only a raw RGB image, a natural language instruction, an initial task plan, and failure detection - without additional information such as object detection or predefined conditions that may be unavailable in a given scenario. The replanning process consists of three steps: (i) Look - where LERa generates a scene description and identifies errors; (ii) Explain - where it provides corrective guidance; and (iii) Replan - where it modifies the plan accordingly. LERa is adaptable to various agent architectures and can handle errors from both dynamic scene changes and task execution failures. We evaluate LERa on the newly introduced ALFRED-ChaOS and VirtualHome-ChaOS datasets, achieving a 40% improvement over baselines in dynamic environments. In tabletop manipulation tasks with a predefined probability of task failure within the PyBullet simulator, LERa improves success rates by up to 67%. Further experiments, including real-world trials with a tabletop manipulator robot, confirm LERa's effectiveness in replanning. We demonstrate that LERa is a robust and adaptable solution for error-aware task execution in robotics. The project page is available at https://lera-robo.github.io.

LERa: Replanning with Visual Feedback in Instruction Following

TL;DR

This work targets the brittleness of LLM-driven robotic task planning under dynamic changes and execution failures. It introduces LERa—Look, Explain, Replan—a VLM-based replanner that relies on a single RGB image , instruction , initial plan , and a failure signal to produce a revised plan without requiring object detections or preconditions. Across ALFRED-ChaOS, VirtualHome-ChaOS, TableTop PyBullet, and real-robot experiments, LERa significantly improves success rates (e.g., up to 94% SR in VirtualHome-ChaOS and up to 67% gains in PyBullet) and demonstrates robustness to imperfect error checking. Ablations and VLM-variant analyses reveal the necessity of the three-step Look–Explain–Replan process and highlight how VLM quality and checker reliability affect performance. The inclusion of ALFRED-ChaOS and VirtualHome-ChaOS provides practical benchmarks, and real-world robot trials validate LERa’s applicability to real tasks, making it a robust, adaptable solution for error-aware robotic task execution.

Abstract

Large Language Models are increasingly used in robotics for task planning, but their reliance on textual inputs limits their adaptability to real-world changes and failures. To address these challenges, we propose LERa - Look, Explain, Replan - a Visual Language Model-based replanning approach that utilizes visual feedback. Unlike existing methods, LERa requires only a raw RGB image, a natural language instruction, an initial task plan, and failure detection - without additional information such as object detection or predefined conditions that may be unavailable in a given scenario. The replanning process consists of three steps: (i) Look - where LERa generates a scene description and identifies errors; (ii) Explain - where it provides corrective guidance; and (iii) Replan - where it modifies the plan accordingly. LERa is adaptable to various agent architectures and can handle errors from both dynamic scene changes and task execution failures. We evaluate LERa on the newly introduced ALFRED-ChaOS and VirtualHome-ChaOS datasets, achieving a 40% improvement over baselines in dynamic environments. In tabletop manipulation tasks with a predefined probability of task failure within the PyBullet simulator, LERa improves success rates by up to 67%. Further experiments, including real-world trials with a tabletop manipulator robot, confirm LERa's effectiveness in replanning. We demonstrate that LERa is a robust and adaptable solution for error-aware task execution in robotics. The project page is available at https://lera-robo.github.io.

Paper Structure

This paper contains 21 sections, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: An approach without error handling can be enhanced by the LERa module to dynamically update the plan (a queue of tasks) when errors occur during plan execution. On the left (solid arrows), the plan is generated once at the beginning of an episode and remains unchanged throughout. With LERa (dashed arrows), a Visual Language Model is used to replan based on visual feedback whenever an error occurs.
  • Figure 2: A step-by-step example of replanning using the LERa module. It consists of three main steps: Look, Explain, and Replan. Each step solves its own problem in order to modify the current plan $P$ to a new plan $P'$.
  • Figure 3: The supposed structure of an agent to use LERa consists of three modules: (1) Task Planner, (2) Task Executor, and (3) Task Checker. Blue is for language, yellow is for planning and execution, red is for self-checking, and green is for replanning.
  • Figure 4: Prompt templates used by the LERa module.
  • Figure 5: We evaluate LERa across diverse set of environments. From left to right, visual observations from: ALFRED, VirtualHome, TableTop PyBullet, and TableTop Robotic Stand.
  • ...and 1 more figures