RePLan: Robotic Replanning with Perception and Language Models

Marta Skreta; Zihan Zhou; Jia Lin Yuan; Kourosh Darvish; Alán Aspuru-Guzik; Animesh Garg

RePLan: Robotic Replanning with Perception and Language Models

Marta Skreta, Zihan Zhou, Jia Lin Yuan, Kourosh Darvish, Alán Aspuru-Guzik, Animesh Garg

TL;DR

RePLan presents a hierarchical, perception-grounded framework for online replanning in long-horizon robotic tasks by integrating a high-level LLM planner, a Vision-Language Model perceiver, a low-level reward translator, a MuJoCo-based motion controller, and an LLM/VLM verifier. It introduces the RC Benchmark to evaluate open-ended, multi-step planning with perception feedback and replanning. Empirical results show RePLan achieving roughly four times better success than a language-to-reward baseline and demonstrating real-robot applicability, with a notable dependence on perceptual grounding. The work highlights the critical role of a multi-stage verifier in improving robustness and discusses limitations related to VLM spatial reasoning and perception reliability, pointing to future improvements in vision-grounded reasoning for robotics.

Abstract

Advancements in large language models (LLMs) have demonstrated their potential in facilitating high-level reasoning, logical reasoning and robotics planning. Recently, LLMs have also been able to generate reward functions for low-level robot actions, effectively bridging the interface between high-level planning and low-level robot control. However, the challenge remains that even with syntactically correct plans, robots can still fail to achieve their intended goals due to imperfect plans or unexpected environmental issues. To overcome this, Vision Language Models (VLMs) have shown remarkable success in tasks such as visual question answering. Leveraging the capabilities of VLMs, we present a novel framework called Robotic Replanning with Perception and Language Models (RePLan) that enables online replanning capabilities for long-horizon tasks. This framework utilizes the physical grounding provided by a VLM's understanding of the world's state to adapt robot actions when the initial plan fails to achieve the desired goal. We developed a Reasoning and Control (RC) benchmark with eight long-horizon tasks to test our approach. We find that RePLan enables a robot to successfully adapt to unforeseen obstacles while accomplishing open-ended, long-horizon goals, where baseline models cannot, and can be readily applied to real robots. Find more information at https://replan-lm.github.io/replan.github.io/

RePLan: Robotic Replanning with Perception and Language Models

TL;DR

Abstract

Paper Structure (44 sections, 1 equation, 30 figures, 4 tables)

This paper contains 44 sections, 1 equation, 30 figures, 4 tables.

Introduction
Related work
RePLan: Model Architecture
High-Level LLM Planner
VLM Perceiver
Low-Level LLM Planner
Motion Controller
LLM & VLM Verifier
Reasoning & Control (RC) Benchmark
Experimental Evaluation
Experiment Setup
Baselines and Ablations
Experiment 1: Motion Controller
Experiment 2: Long-horizon task completion
RePLan with Real-Robot Environment
...and 29 more sections

Figures (30)

Figure 1: Overview of RePLan using an example user goal: "place apple in the bowl". RePLan generates a high-level plan of a possible solution, followed by low-level reward functions. If it is unable to accomplish a subtask, perception is used to diagnose any issues that may be present in the scene. For example, in this instance there is a lemon already in the bowl, which the robot must first remove. RePLan generates a plan to remove the obstacle and continue with the task. Once completing all the subtasks, if the original goal still is not accomplished, RePLan can reason at a higher level to try a new solution. Meanwhile, prompting vanilla GPT-4V with an image of the scene and the user goal returns an infeasible plan because it does not take into account object states.
Figure 2: RePLan model architecture for hierarchical reasoning and control. In the Task Control level, RePLan first generates high-level subtasks conditioned on the user's goal and a scene image. Next, the Subtask Control level determines what action types need to be taken to execute the subtask. If the action requires obtaining information about object attributes or states, the VLM Perceiver may be called to answer any questions. VLM answers are reasoned upon and used to update the world knowledge of the Planners. Otherwise, if the subtasks require low-level actions, a low level motion plan is generated and passed to the Low-Level Planner at the Skill Controls level. The Planner generates robot skill-level rewards for execution via MPC. If failures occur, feedback is provided to subtask and task controls for replanning. If the goal is not reached even after replanning and completing the subtasks, the High-Level Planner reflects on past experiences and proposes a new plan. A more detailed version is shown in Appendix Figure \ref{['fig:roadmap']}.
Figure 3: Number of actions the robot executed in each task averaged over ten runs. Actions requiring the Perceiver are shown in pink while those executed using MPC are shown in purple. Standard deviations are shown using gray bars while the minimum and maximum number of actions are shown using gray dots.
Figure 4: Rollout of robot solvingKitchen-Explore. The high-level plan is shown in the top row. The second row shows each subtask and the corresponding reward functions generated by the Low-Level Planner, as well as Perceiver feedback. If the subtask fails, its box is colored in red. If the plan is completed and the goal is achieved, its box is green.
Figure 5: Real-world experiment. The robot is tasked with placing an apple inside the bowl, but it has to figure out that the lemon must first be removed in order to complete the task. The full experiment trajectory can be seen in the video in Supplementary Material.
...and 25 more figures

RePLan: Robotic Replanning with Perception and Language Models

TL;DR

Abstract

RePLan: Robotic Replanning with Perception and Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (30)