Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration

Haokun Liu; Yaonan Zhu; Kenji Kato; Atsushi Tsukahara; Izumi Kondo; Tadayoshi Aoyama; Yasuhisa Hasegawa

Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration

Haokun Liu, Yaonan Zhu, Kenji Kato, Atsushi Tsukahara, Izumi Kondo, Tadayoshi Aoyama, Yasuhisa Hasegawa

TL;DR

This work tackles the challenge of extending LLM-driven robot manipulation beyond simple tasks by integrating a GPT-4 Turbo–based hierarchical planner with real-time environmental perception and a human-in-the-loop HRC framework. It introduces a dual-library system (basic motion functions and DMP-based trajectories) and leverages teleoperation to capture demonstrations that augment the robot's repertoire, enabling efficient handling of long-horizon tasks. Real-world experiments on a Toyota HSR show high executability (0.994) and feasibility (0.975) with an overall success rate of 0.795, and demonstrate that human guidance materially improves the execution of previously infeasible tasks like opening certain doors. The results underscore the practical potential of combining LLM planning, visual perception, and DMP-based learning to achieve robust, adaptable manipulation in complex environments, while highlighting future opportunities to enhance sensing with LIDAR and tactile feedback.

Abstract

Large Language Models (LLMs) are gaining popularity in the field of robotics. However, LLM-based robots are limited to simple, repetitive motions due to the poor integration between language models, robots, and the environment. This paper proposes a novel approach to enhance the performance of LLM-based autonomous manipulation through Human-Robot Collaboration (HRC). The approach involves using a prompted GPT-4 language model to decompose high-level language commands into sequences of motions that can be executed by the robot. The system also employs a YOLO-based perception algorithm, providing visual cues to the LLM, which aids in planning feasible motions within the specific environment. Additionally, an HRC method is proposed by combining teleoperation and Dynamic Movement Primitives (DMP), allowing the LLM-based robot to learn from human guidance. Real-world experiments have been conducted using the Toyota Human Support Robot for manipulation tasks. The outcomes indicate that tasks requiring complex trajectory planning and reasoning over environments can be efficiently accomplished through the incorporation of human demonstrations.

Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration

TL;DR

Abstract

Paper Structure (24 sections, 4 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 24 sections, 4 equations, 6 figures, 2 tables, 2 algorithms.

Introduction
Related Works
Robotics Manipulation with Natural Language
VR-based Teleoperation System
Dynamic Movement Primitives for Trajectory Learning
Methods
Large Language Model for Autonomous Robot Manipulation
Environmental Perception
Human-Robot Collaboration
Experiments and Discussion
Zero-shot Basic Tasks
Put&Stack
Open
Close
Power on
...and 9 more sections

Figures (6)

Figure 1: An overview of an LLM-based Human-Robot Collaboration System, featuring user interaction, a basic library for pre-programmed motion functions, and a DMP library for adaptive motion function generation and storage to accomplish a complex real-world task. (e.g. "warm up my lunch")
Figure 2: Illustration of how the LLM understands, classifies, decomposes, and executes different tasks.
Figure 3: An overview of LLM-based autonomy for a real-world long-horizon task. This process encompasses sub-task identification, motion function selection from the basic library, environment perception integration, and executable code generation.
Figure 4: An overview of the LLM-based autonomy with Human-Robot Collaboration in sub-task (short-horizon task). The LLM processes user input to select motion functions from the basic Library. These selected motions are subsequently modified through the user interface with teleoperation. The updated motion functions are stored in the DMP Library with a specific name such as "open_oven_handle" (The LLM captures the action "open" and the target "oven_handle", then integrates them as "open_oven_handle") for future application (same task re-input or reusing in the long-horizon task), resulting in successful one-shot task execution.
Figure 5: An overview of the usage of the user interface. A user interface is used for supervising and intervening in a robot’s task sequence, including the communication between the user and the system and the steps for modifying robotic motions through teleoperation.
...and 1 more figures

Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration

TL;DR

Abstract

Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration

Authors

TL;DR

Abstract

Table of Contents

Figures (6)