Talk Through It: End User Directed Manipulation Learning

Carl Winge; Adam Imdieke; Bahaa Aldeeb; Dongyeop Kang; Karthik Desingh

Talk Through It: End User Directed Manipulation Learning

Carl Winge, Adam Imdieke, Bahaa Aldeeb, Dongyeop Kang, Karthik Desingh

TL;DR

This work introduces an end-user directed hierarchical learning framework for robot manipulation, decomposing capability into a Level-1 factory model of primitive actions and higher-level Level-2/Level-3 home models trained by end users via natural language. Demonstrations for complex skills are collected through language rather than scripted data, enabling personalized task learning across 14 RLBench environments. The approach shows significant improvements over baselines, with Level-2 and Level-3 gains of roughly 1.7x and 2.3x, respectively, and investigates the use of Bard VLMs to autonomously decompose tasks, finding VLMs excel at high-level planning but struggle with low-level grounded actions. The results highlight the practical potential for user-driven customization in home robotics, while identifying current VLM limitations and the need for broader training environments and grounding for robust real-world deployment.

Abstract

Training generalist robot agents is an immensely difficult feat due to the requirement to perform a huge range of tasks in many different environments. We propose selectively training robots based on end-user preferences instead. Given a factory model that lets an end user instruct a robot to perform lower-level actions (e.g. 'Move left'), we show that end users can collect demonstrations using language to train their home model for higher-level tasks specific to their needs (e.g. 'Open the top drawer and put the block inside'). We demonstrate this hierarchical robot learning framework on robot manipulation tasks using RLBench environments. Our method results in a 16% improvement in skill success rates compared to a baseline method. In further experiments, we explore the use of the large vision-language model (VLM), Bard, to automatically break down tasks into sequences of lower-level instructions, aiming to bypass end-user involvement. The VLM is unable to break tasks down to our lowest level, but does achieve good results breaking high-level tasks into mid-level skills. We have a supplemental video and additional results at talk-through-it.github.io.

Talk Through It: End User Directed Manipulation Learning

TL;DR

Abstract

Paper Structure (22 sections, 8 figures, 9 tables)

This paper contains 22 sections, 8 figures, 9 tables.

INTRODUCTION
RELATED WORK
Learning from Demonstrations
Demonstration Data Acquisition
Reasoning via Large Vision-Language Models:
Framework
Framework Architecture
Model Levels
Environments
Level-1 Factory Model
Collecting Demonstrations with Language
Home Model Training
Experiments & Results
Learning Level-1 Primitive Actions
Language Augmentation
...and 7 more sections

Figures (8)

Figure 1: A level-1 factory model is trained with a diverse set of primitive action commands. End users train the robot in their homes to complete the tasks they care about by using the level-1 commands to collect demonstrations. We call this the home model.
Figure 2: Our architecture includes a command classifier which determines whether to run an observation-dependent or observation-independent model. The factory model and home model include both models. The observation-dependent model is fine-tuned in the home model.
Figure 3: The Level-1 factory model is trained on scripted demonstrations to perform primitive actions from language commands. An end user trains the robot to perform Level-2 skills in their home by using the Level-1 action commands to collect demonstrations of desired skills. They can then train Level-3 tasks by utilizing Level-1 action commands and Level-2 skill commands to collect demonstrations of desired tasks. These demonstrations collected by the end user only use natural language; no programming or special hardware is required. Different end users may choose to train different skills and tasks according to their needs.
Figure 4: Primitive motion (Level-1) commands are used to collect a skill (Level-2) demonstration of sweeping dust into the large dustpan.
Figure 5: The prompt template shown above is used to query the VLM for the next actions. Every action proposed is executed by the policy for 8 steps. The images are updated after every execution.
...and 3 more figures

Talk Through It: End User Directed Manipulation Learning

TL;DR

Abstract

Talk Through It: End User Directed Manipulation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)