Table of Contents
Fetching ...

Remote Life Support Robot Interface System for Global Task Planning and Local Action Expansion Using Foundation Models

Yoshiki Obinata, Haoyu Jia, Kento Kawaharazuka, Naoaki Kanazawa, Kei Okada

TL;DR

This study implements prompt generation for each robot action function based on template variables to collect information, and a feedback system for presenting and selecting options based on template variables for user-to-robot communication.

Abstract

Robot systems capable of executing tasks based on language instructions have been actively researched. It is challenging to convey uncertain information that can only be determined on-site with a single language instruction to the robot. In this study, we propose a system that includes ambiguous parts as template variables in language instructions to communicate the information to be collected and the options to be presented to the robot for predictable uncertain events. This study implements prompt generation for each robot action function based on template variables to collect information, and a feedback system for presenting and selecting options based on template variables for user-to-robot communication. The effectiveness of the proposed system was demonstrated through its application to real-life support tasks performed by the robot.

Remote Life Support Robot Interface System for Global Task Planning and Local Action Expansion Using Foundation Models

TL;DR

This study implements prompt generation for each robot action function based on template variables to collect information, and a feedback system for presenting and selecting options based on template variables for user-to-robot communication.

Abstract

Robot systems capable of executing tasks based on language instructions have been actively researched. It is challenging to convey uncertain information that can only be determined on-site with a single language instruction to the robot. In this study, we propose a system that includes ambiguous parts as template variables in language instructions to communicate the information to be collected and the options to be presented to the robot for predictable uncertain events. This study implements prompt generation for each robot action function based on template variables to collect information, and a feedback system for presenting and selecting options based on template variables for user-to-robot communication. The effectiveness of the proposed system was demonstrated through its application to real-life support tasks performed by the robot.

Paper Structure

This paper contains 11 sections, 9 figures.

Figures (9)

  • Figure 1: Problem solved in this research (top) and the proposed system (bottom). The user requests the robot to buy food and bring drinks, but the robot does not know what to look for on-site, what the user wants, or where to deliver the items. In this research, we propose using natural language instructions with template variables surrounded by @ to specify the information that needs to be collected and the queries that need to be made when the user instructs the robot. When executing action functions containing template variables, the robot sets up recognizers (Vision Language Model prompts) according to the content of the template variables, presents options to the user's operation device, and continues task actions by receiving feedback from these options and the operation UI.
  • Figure 2: Overview of the proposed system in this research. The system has two stages: the stage where the user communicates tasks to the robot, generates and saves task scripts, and the stage where task scripts are executed. In the stage of generating Task Scripts, the user communicates task instructions with template variables surrounded by @ to the robot in text. The robot inputs the instructions into the LLM along with prompts about the robot's functions and presents the decomposed robot action sequence to the user for confirmation. Once the user approves, the action sequence is input into Dialogflow for conversion into an EusLisp script, which is then registered in the database. In the stage of executing Task Scripts, when the user communicates a task instruction to the robot, the robot searches for similar tasks registered in the past and loads the script. The robot then executes the EusLisp script to perform the task. If the robot encounters template variables surrounded by @ during task execution, it generates appropriate VLM prompts to obtain information from the VLM or queries the user via chat buttons or AR devices to continue the operation.
  • Figure 3: System for the robot to collect information about arguments surrounded by templates on-site. Prompt templates for input to the VLM corresponding to the functions are prepared in advance, and the user's input is substituted to create the VLM's language input. The image input to the VLM is provided by the high-resolution camera Azure Kinect mounted on the robot.
  • Figure 4: AR system for presenting template variable options from the robot to the user and presenting object poses from the user to the robot. The AR glasses menu UI displays a list of objects detected by the robot based on the robot's information. When the user selects an object from the menu UI and places it around themselves, as shown in Fig. \ref{['figure:ar-ui']}, the pose is sent to the robot. The AR device uses Microsoft HoloLens 2, the runtime uses Unity OpenXR Plugin, the UI is created with Microsoft's Mixed Reality Toolkit 2, and communication between the AR device and ROS uses ROS TCP Connector / Endpoint.
  • Figure 5: The user manipulates objects in AR, and the information is transmitted to the robot. When the user places a drink from the menu on the table, the type and pose of the drink are transmitted to the robot.
  • ...and 4 more figures