Table of Contents
Fetching ...

Enabling robots to follow abstract instructions and complete complex dynamic tasks

Ruaridh Mon-Williams, Gen Li, Ran Long, Wenqian Du, Chris Lucas

TL;DR

This paper addresses the challenge of flexible, robust execution of high-level human instructions by robots in unstructured homes. It proposes a framework that integrates Large Language Models with a curated function knowledge base and integrated force and visual feedback (IFVF), using Retrieval-Augmented Generation to generate task-specific code that orchestrates manipulation actions. Key contributions include (1) transforming abstract goals into executable, context-aware policies via GPT-4 and RAG, (2) seamless integration of force and vision feedback to handle disturbances, and (3) a scalable setup demonstrated on coffee making, plate decoration, and related actions with a Kinova 7-DOF arm. The results show improved adaptability and accuracy in dynamic tasks, with a practical path toward scalable autonomous home robotics. The work also discusses limitations and future directions, such as proactive planning and more advanced dynamic models.

Abstract

Completing complex tasks in unpredictable settings like home kitchens challenges robotic systems. These challenges include interpreting high-level human commands, such as "make me a hot beverage" and performing actions like pouring a precise amount of water into a moving mug. To address these challenges, we present a novel framework that combines Large Language Models (LLMs), a curated Knowledge Base, and Integrated Force and Visual Feedback (IFVF). Our approach interprets abstract instructions, performs long-horizon tasks, and handles various uncertainties. It utilises GPT-4 to analyse the user's query and surroundings, then generates code that accesses a curated database of functions during execution. It translates abstract instructions into actionable steps. Each step involves generating custom code by employing retrieval-augmented generalisation to pull IFVF-relevant examples from the Knowledge Base. IFVF allows the robot to respond to noise and disturbances during execution. We use coffee making and plate decoration to demonstrate our approach, including components ranging from pouring to drawer opening, each benefiting from distinct feedback types and methods. This novel advancement marks significant progress toward a scalable, efficient robotic framework for completing complex tasks in uncertain environments. Our findings are illustrated in an accompanying video and supported by an open-source GitHub repository (released upon paper acceptance).

Enabling robots to follow abstract instructions and complete complex dynamic tasks

TL;DR

This paper addresses the challenge of flexible, robust execution of high-level human instructions by robots in unstructured homes. It proposes a framework that integrates Large Language Models with a curated function knowledge base and integrated force and visual feedback (IFVF), using Retrieval-Augmented Generation to generate task-specific code that orchestrates manipulation actions. Key contributions include (1) transforming abstract goals into executable, context-aware policies via GPT-4 and RAG, (2) seamless integration of force and vision feedback to handle disturbances, and (3) a scalable setup demonstrated on coffee making, plate decoration, and related actions with a Kinova 7-DOF arm. The results show improved adaptability and accuracy in dynamic tasks, with a practical path toward scalable autonomous home robotics. The work also discusses limitations and future directions, such as proactive planning and more advanced dynamic models.

Abstract

Completing complex tasks in unpredictable settings like home kitchens challenges robotic systems. These challenges include interpreting high-level human commands, such as "make me a hot beverage" and performing actions like pouring a precise amount of water into a moving mug. To address these challenges, we present a novel framework that combines Large Language Models (LLMs), a curated Knowledge Base, and Integrated Force and Visual Feedback (IFVF). Our approach interprets abstract instructions, performs long-horizon tasks, and handles various uncertainties. It utilises GPT-4 to analyse the user's query and surroundings, then generates code that accesses a curated database of functions during execution. It translates abstract instructions into actionable steps. Each step involves generating custom code by employing retrieval-augmented generalisation to pull IFVF-relevant examples from the Knowledge Base. IFVF allows the robot to respond to noise and disturbances during execution. We use coffee making and plate decoration to demonstrate our approach, including components ranging from pouring to drawer opening, each benefiting from distinct feedback types and methods. This novel advancement marks significant progress toward a scalable, efficient robotic framework for completing complex tasks in uncertain environments. Our findings are illustrated in an accompanying video and supported by an open-source GitHub repository (released upon paper acceptance).
Paper Structure (1 section, 6 figures, 1 table)

This paper contains 1 section, 6 figures, 1 table.

Table of Contents

  1. Overview

Figures (6)

  • Figure 1: Coffee and plate decoration video. Kinova Gen3 Robot prepares coffee and decorates a plate. Click image for video demonstration.
  • Figure 2: Schematic of the system framework. The schematic illustrates the system framework, showing the high-level (above the blue dashed horizontal line) and low-level (below the blue dashed horizontal line) system architecture. User queries are fed into a transformer via voice recognition software. The transformer (GPT-4) takes this input and integrates it with: (i) an image ($C$) of the environment (via an azure Kinect depth camera); (ii) knowledge base of code examples, including various functions stored in a database. The transformer decomposes the higher order abstracted task into actionable high-level subtasks, retrieves relevant code examples from the knowledge base, adapts them and writes python code tailored to these tasks. This code is then sent to the robot controller (A). The controller processes the code and sends control signals ($\lambda$) to the robot. The actions (a) are controlled with force (F) and vision (V) feedback. The model uses vision to identify the properties of different objects (e.g., pose, ($X$), of a coffee cup), so it can grasp objects accurately. The robot uses force ($f$) and torque ($\tau$) feedback (available via an ATI force transducer) to manipulate objects skillfully (e.g., determine how much water to pour). Feedback is necessary due to noise within the vision signal ($\eta_{vision}$), the robot joint angles ($\eta_{angle}$), and the force transducer signal ($\eta_{force}$. The feedback updates the motion in the Robot Operating System (ROS) to achieve the desired goal through velocity commands of both linear ($v_{xyz}$) and angular ($v_{rpy}$) velocities. These commands generate trajectories based on appropriate forces and spatiotemporal patterns to achieve the sub-goals. The use of feedback loops, including 40 Hz updates of the end-effector position ($p$) and orientation ($q$), allow the robot to respond to disturbance (e.g., the robot tracking a cup to determine its new position after it is moved by the user).
  • Figure 3: Action shots of the Kinova Gen3 robot preparing coffee and decorating a plate.
  • Figure 4: Vision detection module. Illustration of the zero-shot vision detection module identifying a hand, white mug, and black kettle, and extracting target poses for robotic grasping.
  • Figure 5: Force ($N$), velocity ($\frac{m}{s}$), and position ($m$) plots during a robot's coffee preparation, illustrating diverse force feedback across different motions. Drawing was left out for clarity.
  • ...and 1 more figures