D-RMGPT: Robot-assisted collaborative tasks driven by large multimodal models

M. Forlini; M. Babcinschi; G. Palmieri; P. Neto

D-RMGPT: Robot-assisted collaborative tasks driven by large multimodal models

M. Forlini, M. Babcinschi, G. Palmieri, P. Neto

TL;DR

D-RMGPT introduces a robot-assisted, markerless assembly planner that leverages large multimodal models to enable inexperienced operators to complete tasks with a collaborative robot. The system combines DetGPT-V for one-shot perception and R-ManGPT for planning and discrete robot actions, both built on GPT-4V and GPT-4, to detect current assembly status and determine the next component. Experiments on a toy aircraft show 83% success and a 33% reduction in assembly time for novices, with DetGPT-V outperforming traditional VLM detectors in component recognition. The work demonstrates the potential of open, general LMM-based perception and reasoning to create intuitive HRI interfaces that adapt to user choices and uncertain environments.

Abstract

Collaborative robots are increasingly popular for assisting humans at work and daily tasks. However, designing and setting up interfaces for human-robot collaboration is challenging, requiring the integration of multiple components, from perception and robot task control to the hardware itself. Frequently, this leads to highly customized solutions that rely on large amounts of costly training data, diverging from the ideal of flexible and general interfaces that empower robots to perceive and adapt to unstructured environments where they can naturally collaborate with humans. To overcome these challenges, this paper presents the Detection-Robot Management GPT (D-RMGPT), a robot-assisted assembly planner based on Large Multimodal Models (LMM). This system can assist inexperienced operators in assembly tasks without requiring any markers or previous training. D-RMGPT is composed of DetGPT-V and R-ManGPT. DetGPT-V, based on GPT-4V(vision), perceives the surrounding environment through one-shot analysis of prompted images of the current assembly stage and the list of components to be assembled. It identifies which components have already been assembled by analysing their features and assembly requirements. R-ManGPT, based on GPT-4, plans the next component to be assembled and generates the robot's discrete actions to deliver it to the human co-worker. Experimental tests on assembling a toy aircraft demonstrated that D-RMGPT is flexible and intuitive to use, achieving an assembly success rate of 83% while reducing the assembly time for inexperienced operators by 33% compared to the manual process. http://robotics-and-ai.github.io/LMMmodels/

D-RMGPT: Robot-assisted collaborative tasks driven by large multimodal models

TL;DR

Abstract

Paper Structure (13 sections, 9 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 9 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Detection-Robot Management GPT (D-RMGPT)
Problem Statement and Assembly Process
Proposed Approach
Prompt Structure
Experiments and Evaluation
System Setup
Evaluation
Baseline Comparison
Results and Discussion
Conclusion and Future Work
Acknowledgements

Figures (9)

Figure 1: The D-RMGPT architecture comprises the detection module DetGPT-V and the robot management and planner module R-ManGPT. These modules, based on GPT-4V(ision) and GPT-4, enable an inexperienced operator to successfully complete an assembly task assisted by a collaborative robot.
Figure 2: Component list image $X_3$. It includes the components picture, number, description and assembly precedence relationships, i.e., the components that need to be assembled before the actual component.
Figure 3: Prompt structure for the detection module DetGPT-V.
Figure 4: Prompt structure for the robot management and planner module R-ManGPT.
Figure 5: Assembly step of the aircraft toy chassis assisted by D-RMGPT.
...and 4 more figures

D-RMGPT: Robot-assisted collaborative tasks driven by large multimodal models

TL;DR

Abstract

D-RMGPT: Robot-assisted collaborative tasks driven by large multimodal models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)