Table of Contents
Fetching ...

Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

Peiyuan Zhi, Zhiyuan Zhang, Yu Zhao, Muzhi Han, Zeyu Zhang, Zhitian Li, Ziyuan Jiao, Baoxiong Jia, Siyuan Huang

TL;DR

The paper tackles open-vocabulary mobile manipulation by introducing COME-robot, a closed-loop system that leverages GPT-4V for open-ended reasoning and code-based task plans. It couples a multi-level open-vocabulary perception and situated reasoning module with a hierarchical closed-loop feedback and restoration mechanism to detect, diagnose, and recover from failures during long-horizon tasks. Real-world experiments across eight OVMM tasks show substantial performance gains over a strong CaP*-based baseline and highlight the importance of robust failure recovery and inter-module tracing. The work demonstrates that integrating foundation-model reasoning with structured perception and robotics primitives enables flexible, long-horizon manipulation in complex environments.

Abstract

Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. In this work, we present COME-robot, the first closed-loop robotic system utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios.COME-robot incorporates two key innovative modules: (i) a multi-level open-vocabulary perception and situated reasoning module that enables effective exploration of the 3D environment and target object identification using commonsense knowledge and situated information, and (ii) an iterative closed-loop feedback and restoration mechanism that verifies task feasibility, monitors execution success, and traces failure causes across different modules for robust failure recovery. Through comprehensive experiments involving 8 challenging real-world mobile and tabletop manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~35%) compared to state-of-the-art methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.

Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

TL;DR

The paper tackles open-vocabulary mobile manipulation by introducing COME-robot, a closed-loop system that leverages GPT-4V for open-ended reasoning and code-based task plans. It couples a multi-level open-vocabulary perception and situated reasoning module with a hierarchical closed-loop feedback and restoration mechanism to detect, diagnose, and recover from failures during long-horizon tasks. Real-world experiments across eight OVMM tasks show substantial performance gains over a strong CaP*-based baseline and highlight the importance of robust failure recovery and inter-module tracing. The work demonstrates that integrating foundation-model reasoning with structured perception and robotics primitives enables flexible, long-horizon manipulation in complex environments.

Abstract

Autonomous robot navigation and manipulation in open environments require reasoning and replanning with closed-loop feedback. In this work, we present COME-robot, the first closed-loop robotic system utilizing the GPT-4V vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios.COME-robot incorporates two key innovative modules: (i) a multi-level open-vocabulary perception and situated reasoning module that enables effective exploration of the 3D environment and target object identification using commonsense knowledge and situated information, and (ii) an iterative closed-loop feedback and restoration mechanism that verifies task feasibility, monitors execution success, and traces failure causes across different modules for robust failure recovery. Through comprehensive experiments involving 8 challenging real-world mobile and tabletop manipulation tasks, COME-robot demonstrates a significant improvement in task success rate (~35%) compared to state-of-the-art methods. We further conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.
Paper Structure (19 sections, 5 figures, 3 tables)

This paper contains 19 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of COME-robot's system. Given a task instruction, COME-robot employs GPT-4V for reasoning and generates a code-based plan. Through feedback obtained from the robot's execution and interaction with the environment, it performs closed-loop replanning and iteratively updates the subsequent plan or recovers from failures, ultimately accomplishing the given task.
  • Figure 2: COME-robot's planner has two key designs: Open-Vocabulary Perception and Reasoning and Closed Loop Feedback and restoration. The former helps the robot ground open-ended instructions in real environment, and the latter guarantees task's completion. Actions to be executed as reasoned by GPT-4V are highlighted in blue, identified failures are highlighted in red, and analysis after observation or verification are highlighted in green.
  • Figure 3: A snapshot of COME-robot's system prompt.
  • Figure 4: A step-by-step visualization of COME-robot's task execution in Gather Cups. With the query "Put the cups on the same table." The robot builds a global object map and locates two tables. It then navigates to table_0, explores locally and identifies one cup on the table. It continues to inspect table_1 and identifies two cups. With situated commonsense reasoning, COME-robot decides to move the cup from table_0 to table_1 as it is more efficient. It thus navigates back to table_0, grasps the cup, and verifies the success of grasp with the wrist camera. Finally, it navigates back to table_1 to place the cup down. With the placement once again verified, the task is considered complete.
  • Figure 5: Two examples of recovery from failures: Case 1 demonstrates recovering from a failed grasp attempt by adjusting grasping position. Case 2 describes a scenario of false positive detection, and recover through visual feedback.